ECSE 507 Lecture Notes (presentation)


These are notes used for lecturing ECSE 507 during the Winter 2024 semester.
Please note that these notes constantly evolve and beware of typos/errors.
These notes are heavily influenced by Boyd and Vandenberghe’s excellent text Convex Optimization.
I appreciate being informed about typos/errors in these notes.

N.B.: This page is formatted to be projected–see here for the unformatted version (i.e., without excessive whitespace).


Click each subject to unfold.

Introduction
General Problem Interested in solving and studying minimization problems subject to constraints:

\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & x \in D \end{cases},

where
  • x are problem parameters;
  • f_0:\mathbb{R}^n \to \mathbb{R} is the objective function and f_0(x) usually interpreted as cost for choosing x ;
  • D \subset\mathbb{R}^n is a constraint set often described “geometrically”.
Example 1. Design Problem Interpretation Let
  • x,y=scalar-valued design variables
    (e.g., dimensions of manufactured object, yaw and pitch of jet).
  • f_0(x,y) = penalty for choosing design (x,y)
    (e.g., cost in material, energy, time, deviation from desired path).
  • D = design specifications
    (i.e., allowable/possible values for (x,y)).
    E.g., D = \{ (x,y) : a<x<b, c<y<d\} specifies minimum and maximum design values.
Then the problem is to find optimal design values (x,y) which minimize cost f_0 and satisfy the design specifications (x,y) \in D.






Example 2. Minimize function over ellipse Let
f_0(x,y) = 1 - \frac{\left(x^{2}+y^{2}\right)}{2}
D = \{ (x,y) \in \mathbb{R}^2 : x^2 + 2 y^2 \leq 1 \} .
Then the problem

\begin{cases} \text{minimize} & 1-\frac{\left(x^{2}+y^{2}\right)}{2}\\ \text{subject to} & x^2 + 2 y^2 \leq 1 \end{cases}

is a familiar kind of purely geometric optimization problem.
It has two solutions: (x,y,f_0(x,y)) = (\pm 1, 0 , \frac{1}{2}) .
Many applied problems can look like this, but with an applied interpretation.















Problems With Structure General optimization problems can be numerically inefficient to solve or analytically difficult, unless f_0 and D have additional structure/properties.
Identifying nice structure/properties of problem \implies problem may become analytically solvable or numerically efficient.
Examples of nice structure:
linearity
f_0(x) = c^T x
D defined in terms of linear equalities/inequalities;
e.g., a_1^T x \geq 0, \ldots, a_m^T x \geq 0, or succinctly written A x \succeq 0.






convexity or quasiconvexity
E.g., convex if f_0(tx+(1-t)y) \leq tf_0(x) + (1-t)f_0(y)
0<t<1
D is itself a convex set.






sparsity or other matrix structure
E.g., f_0(x) = x^T C x and D given by Ax \succeq 0
If C,A sparse
\implies lots of cancellation
\implies improve computational efficiency
\begin{bmatrix} 1 & 0 & 1 & 0 & 0 & 1\\ 0 & 1 & 0 & 0 & 0 & 0\\ 1 & 0 & 0 & 1 & 0 & 1\\ 0 & 1 & 0 & 1 & 0 & 0\\ 0 & 1 & 0 & 0 & 1 & 0 \end{bmatrix} ,\qquad \begin{bmatrix} 1&2&0&0\\ 2&1&0&0\\ 0&0&2&1\\ 0&0&1&2 \end{bmatrix}















Linear Programming Simplest case: f_0 is linear and D defined in terms of linear constraints.
Notation: if x, y \in \mathbb{R}^n , then

x \succeq y means x_i \geq y_i for i =1,\ldots, n .

c^T = transpose of the column vector c \in \mathbb{R}^n.
Linear program: given c \in \mathbb{R}^n, b \in \mathbb{R}^m, A \in \mathbb{R}^{m \times n}, solve

\begin{cases} \text{minimize}&c^T x\\ \text{subject to}& Ax \succeq b\\ &x \succeq 0 \end{cases}.

Thus f_0(x) = c^T x and D = \{x \in \mathbb{R}^n: Ax \succeq b, x \succeq 0\}.

Positives of linear programming:
  • Conceptually simple: relies heavily on linear algebra
  • There are classical numerical methods which are often very efficient.
  • If x^\star \in D is local minimizer of f_0 on D, then it is automatically a global minimizer on D.
  • Can sometimes approximate smooth problems linearly; however, usually can only give “local” results.
    (E.g., \sin(x) \sim x for 0 \leq x \ll 1 .)


Shortcomings of linear programming:
  • Many applied problems are not linear.
  • Many problems may not even be (suitably) approximated by linear programs.
    E.g., the “barrier”

    I_{-}(x) =  \begin{cases} 0 & x <0\\ \infty & x\geq0 \end{cases}

    is better approximated by the a “logarithmic barrier” of the form - c \log(-x) than any linear function.















Convex Optimization Convex optimization problem: f_0 and D are convex.
This is the main focus of the course.

Positives of Convex Optimization:
  • Relatively conceptually simple.
  • Still often have efficient, albeit more sophisticated, numerical methods.
  • Many applied problems may be recast as or approximated by convex optimization problems.
  • If x^\star \in D is local minimizer of f_0 on D, then it is automatically a global minimizer on D.


Shortcomings of Convex Optimization:
  • Problems may seriously fail to be analytically tractable or numerically efficient.
  • Exists nonlinear problems which cannot be approximated by convex problems.















  • Example (Least Squares) A standard and ubiquitous kind of convex optimization problem is the least squares problem.
    This problem takes the form:

    \begin{cases} \text{minimize} & \Vert Ax-b \Vert^2\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ &h_i(x) =0, i=1,\ldots,p \end{cases},

    where
    • \Vert \cdot \Vert is some norm
    • A \in \mathbb{R}^{m \times n} a matrix
    • b \in \mathbb{R}^m a fixed vector
    • f_0(x) = \Vert A x-b \Vert^2 is the (convex) objective function
    • f_i,h_i are convex.
    N.B.: will come back to this problem.
    Example: Distance from points to ellipse If \Vert x \Vert^2 = x_1^2 + \cdots + x_n^2, A is the identity matrix, and D is an ellipsoid, then the solution is the point in the ellipsoid closest to the point b.
    The image below depicts this situation with

    b = \begin{bmatrix}0\\2\end{bmatrix} and D = \{(x,y): (x-2)^2 + 2(y-2)^2 \leq 1 \} .

    Here, the optimal solution is (x,y) = (2-\sqrt{2},2) .















    Optimal Control Let
    x(t),u(t),a(t),f(t) be functions \mathbb{R} \to \mathbb{R} .
    Assume x(t) evolves by

    \begin{aligned}  \dot x(t) &= a(t) x(t) + f(t) u(t)\\ \end{aligned}

    Here,
    • \dot x(t) is the time derivative of x(t) .
    • a(t),f(t) are assumed to be given;
    • we think of x(t) as being the state of some system at time t ;
    • we think of u(t) as an input we are allowed to choose to dictate the evolution x(t) of x(0) ; i.e., u(t) “controls” the system;
    • When u(t)=u(x(t),t) , the system experiences “feedback.”
    • the goal: choose control law u(t) = u(x(t),t) so that x(t) is as “desirable” as possible.
    Example: Optimal Control Problems Optimal control problem: choose the “best” control u(t) which gives the “most” desirable x(t) .
    Typically “best” and “desirable” are determined by size/cost of u(t) and x(t) ; e.g., one may wish to minimize

    \int u(t)^2 + x(t)^2 dt.

    Therefore: problem is to solve (roughly speaking)

    \begin{cases} \text{minimize} & \int u(t)^2 + x(t)^2 dt\\ \text{subject to} & x'(t) = a(t) x(t) + f(t) u(t)\\ \end{cases}

    This will be another focus of the course and we will see some optimal control problems can be recast as convex optimization problems.















    Rough Outline of Course
    Part 1: Basics of Convexity and Convex Optimization Problems.
    Part 2: Applications of Convex Optimization Problems.
    Part 3: Algorithms for Solving Convex Optimization Problems.
    Part 4: Topics in Optimal Control.















    Convex Geometry
    Convex Sets Convex set: a subset X \subset \mathbb{R}^n satisfying:

    for all x,y \in X and t \in [0,1], there holds tx + (1-t) y \in X.

    I.e., X contains all line segments whose endpoints belong to X.

    Examples: some standard convex sets.

    1. Closed or open polytopes in \mathbb{R}^n.
      E.g., the interior of a tetrahedron.
    2. Euclidean balls, ellipsoids.
    3. Linear subspaces and affine spaces (e.g., lines, planes).
    4. Given a norm \Vert \cdot \Vert on \mathbb{R}^n, the \Vert \cdot \Vert-ball

      \{ x \in \mathbb{R}^n : \Vert x - x_{0} \Vert \leq r \}

      with center x_0 \in \mathbb{R}^n and radius r>0 is a convex set.
      Recall: a norm satisfies
      • \Vert x+ y \Vert \leq \Vert x \Vert + \Vert y \Vert for all vectors x,y ;
      • \Vert c x \Vert = |c| \Vert x \Vert for all vectors x and scalars c ;
      • \Vert x \Vert=0 iff x =0.















    Affine Subsets Affine subset: A subset X \subset \mathbb{R}^n satisfying:

    For all x,y \in X and t \in \mathbb{R}, there holds tx + (1-t) y \in X.

    I.e., X contains all lines which pass through two distinct points in X.
    N.B.: An affine subset is just a translated linear subspace:
    “a linear space that’s forgotten its origin”.
    Example 1. Let X = \{ (x,y,0): x,y \in \mathbb{R} \} be the xy-plane in \mathbb{R}^3.
    Then any translation or rotation of X is an affine subset.
    Example 2. If A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m, then X=\{x \in \mathbb{R}^n: Ax = b\} is affine.
    (X is just a translate of \text{ker}\,A.)
    Example 3. Let

    A =  \begin{bmatrix} 0&0&0\\ 0&0&0\\ 0&0&1 \end{bmatrix}\qquad \text{ and } \qquad b =  \begin{bmatrix} 0\\0\\3 \end{bmatrix} .

    Then the solution set to Ax=b is

    \{(x,y,3): x,y \in\mathbb{R}\},

    which is just \text{ker}\, A translated by 3 in the z-direction:















    Cones Cone: A subset X \subset \mathbb{R}^n satisfying:

    For all x \in X and t \geq 0, there holds tx \in X.

    I.e., X contains all “positive” rays emanating from the origin and passing through any of its points.
    Proposition. X \subset \mathbb{R}^n is a convex cone iff for all x_1,x_2 \in X and \theta_1,\theta_2\geq0, there holds \theta_1 x_1 + \theta_2 x_2 \in X.
    Proof.
    Step 1. (\implies ) Suppose X is a convex cone and let x_1,x_2 \in X and \theta_1,\theta_2\geq0 be arbitrary.
    Want to show: \theta_1 x_1 + \theta_2 x_2 \in X .
    Step 2. Being conic implies x = \frac{\theta_1}{t} x_1 and y=\frac{\theta_2}{1-t}x_2 belong to X for all 0<t<1.
    Step 3. Being convex implies tx+ (1-t)y = \theta_1 x_1 + \theta_2 x_2 \in X, as desired.
    Step 4. (\impliedby ) Suppose X is such that \theta_1 x_1 + \theta_2 x_2 \in X for all x_1,x_2 \in X and \theta_1,\theta_2\geq0.
    Want to show: X is a convex cone.
    Step 5. X being conic follows from taking \theta_1\geq0 arbitrary and \theta_2 = 0.
    Step 6. Convexity follows from taking \theta_1 + \theta_2 = 1 with \theta_1,\theta_2 \geq0 and 0 \leq \theta_1 \leq 1 .
    Indeed: \theta_1 + \theta_2 = 1 and t := \theta_1 \implies t x_1 + (1-t) x_2 \in X for 0 \leq t \leq 1 since 1-t = \theta_2 .

    Examples.

    1. Hyperplanes \{ x \in \mathbb{R}^n : a^{T} x = 0\} with normal a \in \mathbb{R}^n,
      halfspaces \{ x \in \mathbb{R}^n : a^{T}x \leq 0 \},
      nonnegative orthants \{x \in \mathbb{R}^n: x \succeq 0\} are all convex cones.
      (Here, u \succeq v if u_i \geq v_i for i = 1,\ldots,n.)
    2. Given a norm \Vert\cdot\Vert on \mathbb{R}^n, the \Vert\cdot\Vert-norm cone is

      \{ (x,t) \in \mathbb{R}^{n+1} : \Vert x \Vert \leq t \},

      which is a convex cone in \mathbb{R}^{n+1}.
    3. See “positive semidefinite cone” below.















    Polyhedra Polyhedron: Any subset X \subset \mathbb{R}^n of the form

    X = \{ x \in \mathbb{R}^n : a_j^T x \leq b_j,\, c_i^T x = d_i \}
    j=1,\ldots,m
    i=1,\ldots,p

    given the vectors a_j,c_i \in \mathbb{R}^n and scalars b_j,d_i \in \mathbb{R}.
    Thus, X is a finite intersection of halfspaces and hyperplanes.
    N.B.: Introducing equality constraints can be used to reduce dimension.
    Example. The polyhedron below is given by the indicated system of inequalities:



    \begin{aligned} y-x& \leq0\\ -x-y&\leq -1\\ x &\leq 3\\ -x &\leq -1 \end{aligned}

    It is an easy exercise to rewrite the inequalities in the notation a_j^T x \leq b_j for suitable a_j,b_j .















    Positive Semidefiniteness
    Symmetric matrix: a matrix X \in \mathbb{R}^{n \times n} satisfying X = X^T ; i.e.,

    X = \begin{bmatrix}  x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \cdots & x_{nn} \end{bmatrix}  = \begin{bmatrix} x_{11} & x_{21} & \cdots & x_{n1} \\ x_{12} & x_{22} & \cdots & x_{n2}\\ \vdots & \vdots & \ddots & \vdots\\ x_{1n} & x_{2n} & \cdots & x_{nn} \end{bmatrix}  = X^T.

    Set of symmetric matrices:

    \boldsymbol{S}^n = \{ X \in \mathbb{R}^{n \times n}: X = X^T \}.






    Positive semidefinite matrix: X \in \boldsymbol{S}^n satisfying z^T X z \geq 0 for all z \in \mathbb{R}^n.
    Equivalently, X only has nonnegative eigenvalues.
    If X \in \boldsymbol{S}^n is positive semidefinite, then write X \succeq 0 .
    If X,Y \in \boldsymbol{S}^n and X-Y\succeq0 , then write X \succeq Y .
    Set of symmetric positive semidefinite matrices:

    \boldsymbol{S}_+^n = \{ X \in \boldsymbol{S}^n: X \succeq 0 \}.

    (N.B.: \succeq is not the same as component-wise inequality, as was the case for vectors.)




    Positive definite matrix: X \in \boldsymbol{S}_+^n satisfying z^T X z = 0 if and only if z=0.
    Equivalently, X only has positive eigenvalues.
    If X \in \boldsymbol{S}^n_+ is positive definite, then write X \succ 0 .
    If X,Y \in \boldsymbol{S}^n_+ and X-Y\succ0 , then write X \succ Y .
    Set of symmetric positive definite matrices:

    \boldsymbol{S}_{++}^n = \{ X \in \boldsymbol{S}^n: X \succ 0 \}.







    Example 1 Let A = \begin{bmatrix} 2 & 0 \\  0 & 4  \end{bmatrix} .
    Since A has positive eigenvalues \{2,4\} , it follows that A \succ 0 .
    To see A \succ 0 explicitly, observe

    \begin{aligned}  z^T A z &= \begin{bmatrix} z_1 & z_2 \end{bmatrix}  \begin{bmatrix} 2 & 0\\ 0 & 4 \end{bmatrix}  \begin{bmatrix} z_1 \\ z_2 \end{bmatrix}\\ &= 2 z_1^2 + 4 z_2^2\\ &\geq0 \end{aligned}


    with z^T A z = 0 iff z = \begin{bmatrix}0\\0\end{bmatrix} .



    Example 2. Let B = \begin{bmatrix} 4 & 0 \\ 0 & 0 \end{bmatrix}.
    Since B has nonnegative eigenvalues \{ 4,0 \} , it follows that B \succeq 0 .
    To see B \succeq 0 explicitly, observe

    \begin{aligned} z^TBz &= \begin{bmatrix}z_1 & z_2 \end{bmatrix} \begin{bmatrix} 4 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} z_1 \\ z_2 \end{bmatrix}\\ &=4 z_1^2\\ &\geq 0. \end{aligned}

    Evidently, z^T B z=0 for z = \begin{bmatrix} 0\\z_2 \end{bmatrix} and so B \not\succ 0.



    Example 3. Let C =\begin{bmatrix} 3 & 2 \\ 2 & 3 \end{bmatrix} \succ 0 .
    One can conclude C \succ0 by showing that C has positive eigenvalues \{ 5,1 \} .
    To see it directly, compute

    \begin{aligned} z^T C z &= \begin{bmatrix}z_1 & z_2 \end{bmatrix} \begin{bmatrix} 3&2\\2&3 \end{bmatrix} \begin{bmatrix} z_1 \\z_2 \end{bmatrix}\\ &= 3z_1^2 + 4z_1z_2 + 3z_2^2. \end{aligned}

    But the discriminant (with respect to z_1 ) of this quadratic satisfies -20 z_2^2 \leq 0 , from which we conclude the polynomial is positive unless z_1=z_2=0 and hence C \succ 0 .



    Example 4. Let D = \begin{bmatrix} 1 & 2 \\ 2 & 1 \end{bmatrix} .
    We can conclude D \not\succeq 0 , either compute its eigenvalues \{3,-1\} or observe that \det D = 1 - 4 = -3 and whence D cannot have only nonnegative eigenvalues.















    Positive Semidefinite Cone Proposition 1. \boldsymbol{S}^n is a \frac{n(n+1)}{2}-dimensional real vector space and \boldsymbol{S}_+^n is a convex cone in \boldsymbol{S}^n.
    Proof.
    Step 1. \boldsymbol{S}^n is a vector space: if X,Y \in \boldsymbol{S}^n and c \in \mathbb{R}, then it is easy to see:

    (X+cY)^T = X^T + cY^T = X + cY.

    and so X+cY \in \boldsymbol{S}^n.


    Step 2. \text{dim}\,\boldsymbol{S}^n = \frac{n(n+1)}{2}: since X \in \boldsymbol{S}^n implies X=X^T, we have the identification

    \begin{aligned} X&= \begin{bmatrix} \boldsymbol{x_{11}} & \boldsymbol{x_{12}} &  \boldsymbol{x_{13}} & \cdots & \boldsymbol{x_{1n}}\\ x_{12} & \boldsymbol{x_{22}} & \boldsymbol{x_{23}} &\cdots & \boldsymbol{x_{2n}}\\ x_{13} & x_{23} & \boldsymbol{x_{33}} &\cdots & \boldsymbol{x_{3n}}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ x_{1n} & x_{2n} & x_{3n} & \cdots & \boldsymbol{x_{nn}}\\ \end{bmatrix}\\ &\qquad\qquad\qquad\qquad\iff\\ \xi&:= \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} & x_{22} & x_{23} & \cdots & x_{2n} & \cdots & x_{nn} \end{bmatrix}^T, \end{aligned}

    where the bolded entries in X indicate the unique contributions to making \xi.
    Counting the number of bold entries shows \xi has \frac{n(n+1)}{2} entries and hence \xi \in \mathbb{R}^{\frac{n(n+1)}{2}}.


    Step 3. \boldsymbol{S}_+^n is a convex cone: For \theta_1,\theta_2\geq0, X,Y \in \boldsymbol{S}_+^n and z \in \mathbb{R}^n there holds

    z^T(\theta_1 X +\theta_2 Y)z = \theta_1 z^T X z + \theta_2 z^T Y z \geq 0,

    and so \theta_1 X + \theta_2 Y \in \boldsymbol{S}_+^n .
    By the proposition in Convex Geometry.Cones, we conclude the desired result.










    Proposition 2. \begin{bmatrix} a&b\\b&c \end{bmatrix} \in \boldsymbol{S}_+^2 iff

    a,c\geq0 and \det \begin{bmatrix} a&b\\b&c \end{bmatrix} = ac - b^2 \geq 0.

    Proof.
    Step 1. Let

    x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \text{and} \quad  X = \begin{bmatrix} a&b\\b&c \end{bmatrix},


    recalling that X \in \boldsymbol{S}_+^2 iff x^T X x \geq 0 for all x \in \mathbb{R}^2.

    Step 2. (Case a=0 ) First compute

    x^T X x = \begin{bmatrix}x_1 & x_2 \end{bmatrix} \begin{bmatrix} 0&b\\b&c \end{bmatrix}\begin{bmatrix} x_1\\x_2\end{bmatrix} = 2bx_1x_2+cx_2^2.

    Observe that

    2bx_1x_2 + cx_2^2 \geq 0 for all x_1,x_2 \in \mathbb{R} iff b=0,c\geq0 .

    Note (in case a=0 ): ac - b^2 = 0 iff b=0 .
    Can thus conclude (in case a=0 ):

    x^TXx \geq 0 for all x \in \mathbb{R}^n iff c\geq0, \det X =0 .



    Step 3. (Case a \neq 0 ) Completing the square gives

    \begin{aligned}  x^T X x &= \begin{bmatrix}x_1&x_2\end{bmatrix}\begin{bmatrix}a&b\\b&c\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix}\\ &=ax_1^2 + 2bx_1 x_2 + c x_2^2 \\ &= a(x_1 + a^{-1}bx_2)^2 + a^{-1}\det\, X\, x_2^2. \end{aligned}

    But

    a(x_1 + a^{-1}bx_2)^2 + a^{-1}\det\, X\, x_2^2 \geq0 for all x_1,x_2 \in \mathbb{R}

    iff

    a>0 and \det\,X \geq0 .

    (To conclude a>0 , take x_2=0.)
    N.B.: strictly speaking, c\geq0 was not used anywhere; however, X \succeq0 immediately implies c \geq0 , and a>0,ac-b^2 \geq0 also implies c\geq0 .

    Step 4. Putting Steps 2. and 3. together, we conclude:
    \begin{bmatrix} a&b\\b&c \end{bmatrix} \in \boldsymbol{S}_+^2 iff

    a,c\geq0 and \det \begin{bmatrix} a&b\\b&c \end{bmatrix} = ac - b^2 \geq 0.











    Image of \boldsymbol{S}_+^2 In light of Proposition 1 and Proposition 2, we can plot

    \boldsymbol{S}_+^2 \subset \boldsymbol{S}^2 \cong \mathbb{R}^3.

    The image below depicts the boundary of

    \boldsymbol{S}_+^2 \cong \{ (x,y,z): xz \geq y^2 ,x\geq0,z\geq0\}.

    Click to enlarge















    Separating Hyperplanes Let A,B \subset \mathbb{R}^n be two sets.
    Separating hyperplane: a hyperplane given by

    P_{a,b} := \{ x : a^T x = b \}

    for some a \in \mathbb{R}^n and b \in \mathbb{R} such that

    \begin{aligned} a^Tx -b \geq 0 & \text{ on } A\\ a^Tx - b \leq 0 & \text{ on } B \end{aligned}.

    P_{a,b} is said to separate A and B .
    Thus, P_{a,b} cuts \mathbb{R}^n into two halfspaces with one containing all of A and the other containing all of B .
    Separating Hyperplane Theorem. If A,B \subset \mathbb{R}^n are two disjoint convex sets, then there exists a \in \mathbb{R}^n and b \in \mathbb{R} such that P_{a,b} is a separating hyperplane which separates A and B .
    Example 1. Consider the convex sets

    \begin{aligned} A &= \{(x,y) : \left(x-1\right)^{2}+2\left(y-1\right)^{2\ }\leq1 \}\\ B &= \{(x,y) : \left(x+1\right)^{2\ }+\ \left(y+1\right)^{2\ }\leq1\}\\ C &= \{(x,y) : \left(x-2\right)^{2}+2\left(y-1\right)^{2\ }\leq1 \}\\ P:&=P_{(1,1),0} = \{(x,y) : y+x=0 \} \end{aligned}.

    These three sets are indicated in the image below.
    Note that P separates the pairs [B,A] and [B,C] . Moreover, the pair [A,C] cannot be separated since A and C have significant overlap.
    Example 2. Consider the convex sets

    \begin{aligned} A &= \{(x,y) : x^{2\ }+\ y^{2}<1 \}\\ B &= \{(x,y) : \left(x+2\right)^{2\ }+y^{2}<1 \}\\ P:&=P_{(1,0),-1} = \{(x,y) : x=-1 \} \end{aligned}.

    These sets are indicated in the image below.
    First note A \cap B = \emptyset since neither contain their boundaries.
    As such, they have a separating hyperplane which is given by P .

    N.B.: Replacing A,B with their respective closures \overline{A},\overline{B} , the plane P still separates \overline{A},\overline{B} .
    Indeed, x+1 \geq 0 for (x,y) \in A and x+1 \leq 0 for (x,y) \in B .















    Supporting Hyperplanes Let A \subset \mathbb{R}^n be a fixed set and fix a boundary point

    x_0 \in \text{bd}A := \overline{A} \setminus \text{int}A .

    If the plane

    P_{a,a^Tx_0} = \{ x : a^Tx = a^Tx_0 \}

    separates A and the singleton \{x_0\} , then P_{a,a^Tx_0} is called the supporting hyperplane of A at x_0 .
    Equivalently, A lies entirely in a halfspace with boundary given by P_{a,a^Tx_0} .

    (Here: \overline{A} indicates the closure of A and \text{int} A indicates its interior.)

    Example. Consider the convex sets

    \begin{aligned} A &= \{ (x,y) : (x-2)^2 + (y-2)^2 \leq 1 \}\\ P :&= P_{(-1,0),-1} = \{ (x,y) : x = 1 \} \end{aligned}

    with boundary point x_0 = (1,2) \in \partial A .
    Letting a = \begin{bmatrix} -1\\0\end{bmatrix} , we note a^T x_0 + 1 = 0 \geq 0 .
    Next, observing that, if (x,y) \in A , then x \geq 1 and so

    a^Tx + 1 = \begin{bmatrix} -1&0 \end{bmatrix} \begin{bmatrix} x\\y \end{bmatrix} +1 = -x +1 \leq 0.

    Thus P separates A and \{ x_0 \} , showing that P is a supporting hyperplane of A at the boundary point x_0 ; see image below.















    Hulls Let X \subset \mathbb{R}^n be a fixed subset.
    Convex hull: the set

    \{ \theta_1 x_1 + \cdots + \theta_k x_k : x_i \in X,\, \theta_i \geq 0,\, i = 1,\ldots,k,\, \theta_1 + \cdots + \theta_k= 1\}.

    This is just the collection of all convex combinations of points in X and is itself convex.
    Example: The images below depict a set of three points and its convex hull.
    3 points in the plane
    Convex hull of 3 points


    Affine hull: the set

    \{ \theta_1 x_1 + \cdots + \theta_k x_k : x_i \in X,\, i = 1,\ldots,k,\, \theta_1 + \cdots + \theta_k= 1\}.

    This is just the collection of all affine combinations of points in X and is itself affine.
    Example: The images below depict two points and their affine hull.
    Two points in the plane
      
    Affine hull of two points


    Conic hull: The set

    \{ \theta_1 x_1 + \cdots + \theta_k x_k : x_i \in X,\, \theta_i \geq 0,\, i = 1,\ldots,k\}.

    This is just the collection of all conic combinations of points in X and is itself a cone.
    Example: The images below depict two points x_1 = (0.5,1) , x_2 = (1.5,0.5) and their conic hull.
    Two points in the plane
      
    Conic hull of two points
    Details To see that the conic hull really is the shaded region, note that, by taking \theta_1 = t s_1 and \theta_2 = (1-t)s_2 , where t \in [0,1] and s\geq0 , the conic hull contains all points of the form ts_1x_1 + (1-t)s_2x_2 .
    Thus, it contains all line segments connecting any two points on the nonnegative rays \{ s x_1 : s \geq 0 \} and \{ s x_2 : s \geq 0 \} .






    N.B.:
    1. Conic hulls are convex cones.
    2. Taking the “___ hull” of X does indeed result in a “___” set.
    3. The “___ hull” is a construction of the smallest “___” subset containing X.















    Generalized Inequalities Proper cone: a convex cone K \subset \mathbb{R}^n satisfying
    1. K is closed (i.e., K contains its boundary)
    2. K has nonempty interior
    3. x,-x \in K \implies x=0 .
    Generalized inequality: given a proper cone K , a partial ordering \prec_K on \mathbb{R}^n defined by

    x \preceq_K y \iff y-x \in K .

    N.B.: \preceq_K is a partial ordering and so x \preceq_K y is not well-defined for all x,y . Generalized strict inequality: given a proper cone K , a partial ordering \prec_K on \mathbb{R}^n defined by

    x \prec_K y \iff y-x \in \text{int}\,K .

    Examples
    1. (CO Example 2.14)
      If K = \mathbb{R}^n_+ , then \preceq_K is the standard componentwise vector inequality:

      v \preceq w \iff v_i \leq w_i, \, i = 1,\ldots,n .

      N.B.: \preceq_{\mathbb{R}_+} is the standard inequality on \mathbb{R} .
    2. (CO Example 2.15)
      If K = \boldsymbol{S}_+^n , then

      \begin{aligned} A &\preceq_K B \implies B-A \text{ is positive semidefinite}\\ A &\prec_K B \implies B-A \text{ is positive definite} \end{aligned} .

    3. Let

      K = \{(x_1,x_2) : x_1 \leq 2x_2, x_2 \leq 2x_1 \} .

      Then K is a proper cone.
      In the image below:
      • K is the cone with vertex (0,0) .
      • The cone with vertex x = (-1,1) depicts those y \in \mathbb{R}^2 with x \preceq_K y .
      • The cone with vertex x = (1,-1) depicts those y \in \mathbb{R}^2 with x \preceq_K y .
      N.B.: (0,2)-(-1,1) = (1,1) \in K , and so (1,1) \preceq_K (0,2) , as indicated in the image.
      Moreover, (-1,1) and (1,-1) are not comparable.















    Convex Function Theory
    Conventions and Notations
    1. Writing f:\mathbb{R}^n \to \mathbb{R} always means a partial function with domain \text{dom}\,f possibly smaller than \mathbb{R}^n.
      “Function” will mean “partial function.”
    2. If \text{dom}\, f \neq \mathbb{R}^n, we may work with the extension \tilde{f}:\mathbb{R}^n \to \mathbb{R} \cup \{ +\infty \} given by

      \tilde{f}(x) = \begin{cases} f(x) & x \in \text{dom}\, f\\ +\infty & x \notin \text{dom}\, f \end{cases}.

      It is common to implicitly assume f has been extended and to write f for the partial function f and its extension \tilde{f}.
    3. Given a set C \subset \mathbb{R}^n, its indicator function is

      \tilde{I}_C = \begin{cases} 0 & x \in C\\ +\infty & x \notin C \end{cases}.

    4. We write

      \begin{aligned} \mathbb{R}_+ &:= \{ x \in \mathbb{R}: x\geq0 \}\\ \mathbb{R}_{++} &:= \{ x \in \mathbb{R}: x > 0 \} \end{aligned}.
















    Convex Functions Let f:\mathbb{R}^n \to \mathbb{R} be a function with convex domain \text{dom}\, f.
    Convexity: for all x,y \in \text{dom}\, f, t \in [0,1] there holds

    f(t x + (1-t)y) \leq t f(x) + (1-t)f(y)

    (This inequality is often called Jensen’s inequality.)



    Strict convexity: for all x,y \in \text{dom}f,\,  x\neq y, t \in (0,1) there holds

    f(t x + (1-t)y) < t f(x) + (1-t)f(y).


    Example: failure of strict convexity In the figure, the solid line indicates part of the graph of x^4 and the dashed line indicates part of the graph of a linear function.
    This function fails to be everywhere strictly convex due to linear functions satisfying

    f(tx + (1-t)y) = tf(x) + (1-t)f(y) .





    Concavity and strict concavity: when -f is, respectively, convex and strictly convex.







    Remarks.
    1. It is instructive to compare convexity/concavity with linearity and view the former as weak versions of linearity.

    2. It is common to extend the definition of convexity to extended functions, i.e., those of the form f: \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\} .

      For example, the indicator function \tilde{I}_{(-\infty,2)} is convex in this sense.
      To give insight, consider the image below, where the thick line is the “graph” of \tilde{I}_{(-\infty,2)} and the dashed line is the “secant line” connecting the points (1,0) to (x,\infty) for any x\geq 2 .
















    Examples
    1. All linear functions are convex and concave on their domains.
    2. e^x is convex on \mathbb{R}.
    3. |x|^p is convex on \mathbb{R} for p \geq 1 .
    4. x^p is convex on \mathbb{R}_{++} for p \geq 1 or p \leq 0 and concave for 0 \leq p \leq 1.
    5. - \log\det X is convex on \boldsymbol{S}_{++}^n
    6. If C \subset \mathbb{R}^n is convex, then its indicator function \tilde{I}_C is convex (in the extended value sense).















    One Dimensional Characterization Proposition. Let f:\mathbb{R}^n \to \mathbb{R} have convex domain and, given

    x \in \text{dom}\, f and v \in \mathbb{R}^n ,

    define the function g:\mathbb{R} \to \mathbb{R} by

    g(t) = f(x+tv)

    with

    \text{dom}\,g := \{ t \in \mathbb{R} : x + tv \in \text{dom}\, f \}.

    Then f is convex iff g is convex for all x \in \text{dom}\,f and v \in \mathbb{R}^n such that g is well-defined.
    Proof.
    Step 1. First note that \text{dom}\,g is convex: it is the intersection of \text{dom}\,f with the line passing through x  with direction v .



    Step 2. (\implies ) Suppose f is convex and let x \in \text{dom}\,f and v \in \mathbb{R}^n be arbitrary.
    Then, for \theta \in [0,1] and t_1,t_2 \in \text{dom}\, g, there holds

    \begin{aligned} g(\theta t_1 + (1-\theta) t_2) &= f(x + (\theta t_1 + (1-\theta)t_2)v)\\ &= f(x + \theta t_1 v + (1-\theta) t_2 v)\\ &= f(\theta(x + t_1 v) + (1-\theta)(x+t_2 v) ) \\ &\leq \theta f(x+t_1 v) + (1-\theta)f(x + t_2 v)\\ &= \theta g(t_1) + (1-\theta)g(t_2), \end{aligned}

    proving that g is convex.



    Step 3. (\impliedby ) Suppose now that each g is convex.
    Fix x,y \in \text{dom}\,f and let \theta \in [0,1] .
    We want to show

    f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y) .

    Let v = y-x and

    g(t) = f(x + t(y-x)),

    noting that g(0)=f(x), g(1)=f(y).
    Since g is convex, we conclude

    \begin{aligned} f(\theta x + (1-\theta)y) &= f(x + (1-\theta)(y-x))\\ &= g(1-\theta)\\ &=g(\theta \cdot 0 + (1-\theta)\cdot 1)\\ &\leq \theta g(0) + (1-\theta) g(1)\\ &=\theta f(x) + (1-\theta) f(y). \end{aligned}

    This is enough to conclude f is convex.















    First Order Characterization Proposition. If f:\mathbb{R}^n \to \mathbb{R} is differentiable with convex domain \text{dom}\, f, then f is convex iff

    f(x) + \nabla f(x)^{T}(y-x) \leq f(y), \quad \forall x,y \in \text{dom}\, f.


    Proof (sketch). We prove it in case f:\mathbb{R} \to \mathbb{R} ; the higher dimensional case follows by using that f:\mathbb{R}^n \to \mathbb{R} with convex domain is convex iff it is convex as a single variable function when restricted to lines intersecting \text{dom}\,f .
    Throughout, let x,y \in \text{dom}\,f and t \in [0,1] .
    Step 1. (\implies ) If f is convex, then we obtain the following inequalities

    \begin{aligned} f(ty + (1-t)x) \leq tf(y) + (1-t)f(x)\quad&\text{(convexity)}\\ f(x) + \frac{f(x+t(y-x))-f(x)}{t}  \leq f(y)\quad&\text{(rearranging)}\\ f(x) + \frac{f(x+t(y-x))-f(x)}{t(y-x)}(y-x)  \leq f(y)\quad &\text{(rearranging)}\\ f(x) + f'(x)(y-x)  \leq f(y)\quad&\text{(taking }t \to 0 \text{)}. \end{aligned}





    Step 2. (\impliedby ) Supposing

    f(x) + f'(x)(y-x) \leq f(y)

    we set z = tx + (1-t)y and add the two inequalities

    \begin{aligned} t f(z) + tf'(z)(x - z) &\leq tf(x)\\ (1-t) f(z) + (1-t)f'(z)(y - z) &\leq (1-t)f(y) \end{aligned}

    to obtain

    f(tx+(1-t)y)=f(z) \leq tf(x) + (1-t)f(y).








    Remarks.
    1. For fixed x \in \text{dom}\,f, the mapping

      A_x: \, y \mapsto f(x) + \nabla f(x)^{T}(y-x)

      is affine whose graph is a hyperplane passing through the point (x,f(x)) .
      Therefore, the inequality A_x(y) \leq f(y) means this hyperplane is a tangent plane at (x,f(x)) of the graph of f lying under the graph of f .
      In fact, this plane is a supporting hyperplane of the epigraph

      \text{epi}\,f := \{ (x,t) \in \mathbb{R}^{n} \times \mathbb{R} : x \in \text{dom}\,f, t \geq f(x)\}

      at the point (x,f(x)) .


    2. The affine mapping A_x is just the first order Taylor approximation of f at x .
      Thus, differential convex functions are such that their first order Taylor approximations serve as a global underestimators of f .

    Example. In the image below:
    • solid line is the graph of f(x) = e^x ;
    • shaded region is the convex set given by \text{epi}\,f = \{ (x,t): t\geq e^x \} ;
    • dashed line is the supporting hyperplane at (-1,e^{-1}) given by the graph of e^{-1} + e^{-1}(x+1) .


    f(x) + f'(x)(y-x) gives a supporting hyperplane















    Second Order Characterization Proposition. If f is twice-differentiable with \text{dom}\,f convex, then f is convex iff

    \nabla^2 f(x) \succeq 0, \quad \forall x \in \text{dom}\, f.


    Recall: if f: \mathbb{R}^n \to \mathbb{R} is twice-differentiable, then its Hessian is

    \nabla^2 f(x) =  \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(x) \\ \frac{\partial^2 f}{\partial x_2 \partial x_1 }(x) & \frac{\partial^2 f}{\partial x_2^2 }(x) & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(x) \\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n \partial x_1}(x) & \frac{\partial^2 f}{\partial x_n \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x) \\ \end{bmatrix} \in \mathbb{R}^{n \times n}.


    Proof (sketch).
    Step 0. The proof is a little more involved, so let us just give two intuitive justifications.



    Justification 1. The second order Taylor approximation gives

    f(y) = f(x) + \nabla f(x)^T (y-x)  + \frac{1}{2} (y-x)^T \nabla^2 f(x)(y-x)

    up to some small error. But

    \nabla^2 f(x) \succeq 0

    implies

    (y-x)^T \nabla^2 f(x)(y-x) \geq0

    and so

    f(y) \geq f(x) + \nabla f(x)^T (y-x)

    (again, up to some small error).
    The first order approximation from Convex Function Theory.First Order Characterization then implies convexity.



    Justification 2. Another intuitive justification is that \nabla^2 f(x) \succeq 0 means the graph of f curves everywhere upward like a paraboloid, which evidently suggests convexity.





    Remarks.
    1. Recall: for x \in \text{dom}\,f , there holds

      \begin{aligned} \nabla^2 f(x) \succeq 0 & \iff \nabla^2 f(x) \text{ is positive semidefinite}\\ & \iff \nabla^2 f(x) \in \boldsymbol{S}_+^n. \end{aligned}

    2. \nabla^2 f (x) \succ 0 for all x \in \text{dom}\, f implies f is strictly convex.
      Converse is false since x^4 is strictly convex.
















    Level Sets Fix a function f: \mathbb{R}^n \to \mathbb{R} and let c \in \mathbb{R}
    c-Level set: the set

    S(c):=\{ x \in \text{dom}\, f : f(x) = c\}.

    The figure below depicts level sets of x^2 + e^{y^2} with c=5,10,17,26.


    c-Sublevel set: the set

    S_c = \{ x \in \text{dom}\,f : f(x) \leq c\}.

    The figure below depicts the sublevel sets of x^2+e^{y^2} with c=5,10,17,26.
    Each shade of gray indicates a new sublevel set and of course S_c \subset S_{c'} for c<c' .


    c-Superlevel set: the set

    S^c = \{ x \in \text{dom}\,f : f(x) \geq c \}.

    The figure below depicts the superlevel sets of x^2+e^{y^2} with c=5,10,17,26.
    Each shade of gray indicates a new superlevel set and of course S^c \supset S^{c'} for c<c' .






    Proposition. If f is convex, then the sublevel set S_c is convex for all c \in \mathbb{R}.
    Equivalently, if f is concave, then the superlevel set S^c is convex for all c \in \mathbb{R} .


    Proof. Want to show: x,y \in S_c implies tx + (1-t)y \in S_c for all t \in [0,1] .
    If x,y \in S_c , then f(x),f(y) \leq c and so convexity of f gives

    \begin{aligned}  f(tx+(1-t)y) &\leq t f(x) + (1-t)f(y) \\ &\leq t c + (1-t)c \\ &= c  \end{aligned}


    and hence tx + (1-t)y \in S_c as desired.















    Graphs Fix a function f: \mathbb{R}^n \to \mathbb{R}.
    Graph: the set

    \{ (x,f(x)):x \in \text{dom}\,f \} \subset \mathbb{R}^n \times \mathbb{R}.

    Example: graph of e^x is given below.


    Epigraph: the set

    \text{epi}\, f = \{ (x,t): x \in \text{dom}\, f, f(x) \leq t \} \subset \mathbb{R}^n \times \mathbb{R}.

    Example: epigraph of e^x is given below.


    Hypograph: the set

    \text{hypo}\, f = \{ (x,t) : x \in \text{dom}\, f, f(x) \geq t \} \subset \mathbb{R}^n \times \mathbb{R}.

    Example: hypograph of e^x is given below.






    Proposition. f is convex iff \text{epi}\, f is convex.
    Equivalently, f is concave iff \text{hypo}\, f is convex.
    Proof. (sketch) We consider the case f:\mathbb{R} \to \mathbb{R} for simplicity.
    Step 1. (\implies ) Suppose f is convex and let x,y \in \text{epi}\,f be distinct points.
    If x,y both lie on a vertical line, then clearly tx + (1-t)y \in \text{epi}\,f for t \in [0,1] ; thus, suppose otherwise.
    Let \ell be the line passing through x,y and let x',y' be the two intersection points of \ell with the graph of f .
    (If at most one intersection point exists, then it is easy to see that the line connecting x and y is in \text{epi}\,f .)
    By convexity of f , the line formed by tx' + (1-t)y' for t \in [0,1] lies in \text{epi}\,f , which is enough to conclude the line given by tx + (1-t)y for t \in [0,1] lies in \text{epi}\,f.
    This shows \text{epi}\, f is convex.
    Step 2. (\impliedby ) Suppose now \text{epi}\,f is convex.
    Let x,y be two distinct points on the graph of f .
    Then x,y \in \text{epi}\, f.
    But convexity of \text{epi}\,f implies the line formed by tx+(1-t)y for t \in [0,1] lies entirely in \text{epi}\,f.
    This is enough to conclude f is convex.















    Convex Calculus The following list details some operations and actions that preserves convexity.
    The main point: to conclude a function f is convex, often one verifies f may be built by other convex functions using, for example, the operations below.
    N.B.: Conclusions only holds on common domains of the functions.
    Conical combinations:

    f_1,\ldots,f_m convex and c_1,\ldots,c_m \geq0

    \implies

    c_1 f_1 + \cdots + c_m f_m convex.





    Weighted averages:

    f(x,y) convex in x, w(y) \geq0 \implies \int f(x,y) w(y) dy convex.





    Affine change of variables:

    f:\mathbb{R}^n \to \mathbb{R} convex, A \in \mathbb{R}^{n\times m}, b \in \mathbb{R}^n

    \implies

    f(A x + b) convex.





    Maximum:

    f_1,\ldots,f_m convex

    \implies

    f(x) := \max\{f_1(x),\ldots,f_m(x)\} convex.





    Supremum:

    f(x,a) convex in x for each a \in \mathcal{A}

    \implies

    h(x):=\sup\{f(x,a):a \in \mathcal{A}\} convex.



    Justification. For t \in [0,1] , there holds

    \begin{aligned}  h(tx + (1-t)y) &= \sup \{ f(tx + (1-t)y,a) : a \in \mathcal{A} \}\\ &\leq \sup \{ tf(x,a) + (1-t) f(y,a) : a \in \mathcal{A}\}\\ &\leq t \sup\{ f(x,a) :a \in \mathcal{A}\} + (1-t)\sup\{f(y,a):a \in \mathcal{A} \}\\ &=th(x) + (1-t)h(y). \end{aligned}



    Example. Let

    \begin{aligned} g(x)&:\mathbb{R}^n \to \mathbb{R} \text{ be given}\\ f(x,y)&:= y^T x - g(x). \end{aligned}

    N.B.: for each x , the mapping y \mapsto f(x,y) is affine and hence convex.
    Thus

    h(y):=\sup\{y^Tx - g(x) : x \in \text{dom}\,g \}

    defines a convex function.




    Infimum:

    f:\mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R} convex in (x,y)

    C \subset \mathbb{R}^m convex

    \inf_{y \in C}f(x,y) finite for some x

    \implies

    g(x):=\inf_{y \in C}f(x,y) convex on

    \text{dom}\,g := \{ x : (x,y) \in \text{dom}\, f \text{ for some } y \in C\}.




















    Fenchel conjugation Let f:\mathbb{R}^n \to \mathbb{R} be given (not necessarily convex).
    Fenchel conjugate: f^*(y) = \sup \{ y^T x - f(x) : x \in \text{dom}\,f\}.

    N.B.:

    \text{dom}\, f^* = \{ y \in \mathbb{R}^n : f^*(y) < \infty \} ;

    i.e., those y \in \mathbb{R}^n for which y^Tx - f(x) is bounded above on \text{dom}\,f as a function of x .

    Intuition. Suppose f: \mathbb{R}_+ \to \mathbb{R}_+ is a differentiable convex function denoting the cost to produce x items.
    For a given unit price y \in \mathbb{R}_+ , the profit of selling x units is

    P(x,y) = yx - f(x) .

    Thus f^*(y) is just the optimal profit for selling at price y .
    N.B.: f convex implies P(\cdot,y) is concave for each y .
    Thus, P(x,y_0) is maximal at x_0 satisfying P'(x_0,y_0) = 0 , i.e., when y_0 = f'(x_0) .
    Viz.: the x_0 where f has slope y_0 .
    The tangent line through (x_0,f(x_0)) is then given by y=y_0 (x - x_0) + f(x_0) .
    Lastly, note that the y-intercept of this line is -y_0x_0 + f(x_0)=-f^*(y_0) ,


    Remarks
    1. Often f^* is just called the conjugate function of f .
    2. Since f^* is the supremum of a family of affine functions, f^* is always convex, even if f is not.
      (Follows from Convex Function Theory.Convex Calculus.
    3. If
      • f is convex
      • \text{epi}\,f is a closed subset of \mathbb{R}^n \times \mathbb{R} ,
      then f^{**} = f .


    Example. We will compute the conjugate function of

    \begin{aligned} f(x) &= e^x\\ \text{dom}\,f &= \mathbb{R}. \end{aligned}


    Thus, let

    \begin{aligned} h_y(x) &= yx - f(x) \\ &= yx - e^x . \end{aligned}

    Case y<0 : h_y(x) is unbounded on \text{dom}\,f = \mathbb{R} since

    h_y(x) \to +\infty as x \to -\infty .

    Thus

    \sup\{yx-e^x: x \in \text{dom}\,f\} = +\infty .



    Case y>0 : Compute

    \begin{aligned} h_y'(x) &= y - e^x =0 \text{ when } x=\log y \\  h_y''(x)&=-e^x \leq 0. \end{aligned}

    Thus x=\log y maximizes h_y and so

    \begin{aligned} \sup\{yx-e^x:x \in \text{dom}\,f\} &= \max\{yx-e^x:x \in \text{dom}\,f \}\\ &= h_y(\log y) \\ &= y\log y - y  \end{aligned}



    Case y=0 : Compute

    h_0(x) = -e^x,

    which evidently has least upper bound 0 and so

    \sup\{ -e^x : x \in \text{dom}\,f \} = 0.



    Conclusion: Since

    yx-e^x

    is bounded on \text{dom}\,f only for y \geq 0 , it follows that

    \text{dom}\, f^* = \mathbb{R}_{+} .

    Putting everything together:

    f^*(y) = \sup\{yx-e^x\} = y\log y - y for y \in \mathbb{R}_+ ,

    where we take 0 \log 0 = 0 .















    Legendre Transform Let f:\mathbb{R}^n \to \mathbb{R} be convex, differentiable and with \text{dom}\,f = \mathbb{R}^n .
    Then, the Fenchel conjugate f^* of f is often called the Legendre transform of f .


    Proposition. If f as above, z \in \mathbb{R}^n and y = \nabla f(z), then

    f^*(y) = z^T \nabla f(z) - f(z) .

    Proof.
    Step 1. Let

    h_y(x) = y^T x - f(x) .

    and note

    x^\star \in \mathbb{R}^n maximizes h_y iff \nabla h_y(x^\star) = 0

    since h_y is a sum of concave functions and hence concave.
    Step 2. Using Step 1. and

    \begin{aligned}  \nabla(y^T x) &= y\\ \nabla h_y(x) &= \nabla(y^Tx-f(x)) = y - \nabla f(x). \end{aligned}

    conclude

    y = \nabla f(x^\star) iff x^\star maximizes h_y .

    (In particular, z \in \mathbb{R}^n maximizes h_{\nabla f(z)} .)
    Step 3. Letting z, y \in \mathbb{R}^n satisfy

    \begin{aligned} y &= \nabla f(z) \end{aligned}

    and using Steps 1. and 2., we conclude

    \begin{aligned}  f^*(y) &= \sup \{ y^T x - f(x) : x \in \text{dom}\, f\}\\ &= \max \{ h_y(x) : x \in \text{dom}\,f \}\\ & = z^T \nabla f(z) - f(z)  \end{aligned} ,

    as desired.


    Example 1. Let

    f(x) = e^x ,

    and compute

    f'(x) = e^x .

    Given z \in \mathbb{R} , let

    y = f'(z) = e^z ; i.e., z = \log y .

    Thus

    f^*(y) = z f'(z) - f(z) = y \log y - y ,

    which agrees with our calculation for f^* in a previous example.


    Example 2. Fix Q \in \boldsymbol{S}_{++}^n and let

    \begin{aligned} f(x) &= \frac{1}{2} x^T Q x\\ \text{dom}\,f&= \mathbb{R}^n. \end{aligned}

    We will compute

    f^*(y) = \frac{1}{2}y^T Q^{-1}y .

    Step 0. Observe
    1. f is convex:

      \begin{aligned} \nabla^2 f(x) = Q \succ 0. \end{aligned}

      (Justification) Consider case n = 2 .
      Let Q = \begin{bmatrix}a&b\\b&d\end{bmatrix}.
      Thus f(x_1,x_2) = \frac{1}{2}(ax_1^2 + 2 b x_1 x_2 + c^2 x_2^2) .
      Easy now to see \nabla^2 f = Q .


    2. Q \succ 0 implies Q is invertible since then \det Q > 0 .




    Step 1. Using

    \begin{aligned} \nabla f(x) &=\nabla(\frac{1}{2}x^TQx) \\ &= Qx, \end{aligned}

    we conclude

    y = \nabla f(z) \iff y = Qz \iff z = Q^{-1}y.





    Step 2. Let y = \nabla f(z) .
    By preceding proposition and Step 1., there holds

    \begin{aligned}  f^*(y) &= z^T \nabla f(z) - f(z)\\ &= (Q^{-1}y)^T y - \frac{1}{2}(Q^{-1}y)^TQ(Q^{-1}y)\\ &= y^T Q^{-1}y - \frac{1}{2} y^T Q^{-1} y\\ &= \frac{1}{2}y^T Q^{-1} y. \end{aligned}
















    Other Notions of Convexity There are two other important notions of convexity that we will return to if needed.
    Let f:\mathbb{R}^n \to \mathbb{R} be given.

    Quasiconvexity: \text{dom}\,f and the sublevel sets

    \{x \in \text{dom}\,f: f(x) \leq \alpha \}

    are convex for all \alpha \in \mathbb{R}.
    Features:
    1. Quasiconvex problems may sometimes be suitably approximated by convex problems.
    2. Local minima need not be global minima


    Log-convexity: f>0 on \text{dom}\, f and \log f is convex; equivalently

    f(tx+(1-t)y) \leq f(x)^tf(y)^{1-t} for all t \in [0,1].
















    Generalized Convexity Let K \subset \mathbb{R}^m be a proper cone and let f: \mathbb{R}^n \to \mathbb{R}^m .

    K -convexity: for all x,y \in \mathbb{R}^n and t \in [0,1] , there holds

    f(tx + (1-t)y) \preceq_K t f(x) + (1-t)f(y) .

    Strict K -convexity: for all x \neq y \in \mathbb{R}^n and t \in (0,1) , there holds

    f(tx + (1-t)y) \prec_K t f(x) + (1-t)f(y) .

    Examples
    1. (CO Example 3.47)
      Let K = \mathbb{R}_+^n .
      Then f: \mathbb{R}^n \to \mathbb{R}^m is K-convex iff: \text{dom}\, f is convex and for all x,y \in \text{dom}f and t \in[0,1], there holds

      f(tx+(1-t)y) \preceq tf(x) + (1-t)f(y)

      which holds iff

      f_i(tx+(1-t)y) \leq tf_i(x) + (1-t)f_i(y)

      for each i = 1,\ldots, m , i.e., iff f is component-wisely convex.
    2. (CO Example 3.48)
      A function f: \mathbb{R}^n \to \boldsymbol{S}^m is \boldsymbol{S}_+^m-convex iff : \text{dom}\,f is convex and for all x,y \in \text{dom}\,f and t \in [0,1] , there holds

      f(tx + (1-t)y) \preceq t f(x) + (1-t)f(y) .

      N.B.:
      • this is a matrix inequality and \boldsymbol{S}_+^n -convexity is often called matrix convexity.
      • f is matrix convex iff z^Tf(x)z is convex for all z \in \mathbb{R}^m .
      • The two functions

        \begin{aligned} \mathbb{R}^{n \times m} \ni X &\mapsto XX^T\\ \boldsymbol{S}_{++}^n \ni X &\mapsto X^p, \quad 1 \leq p \leq 2, -1 \leq p \leq 0. \end{aligned}

        are matrix convex.
    Basics of Optimization Problems
    General Optimization Problems By an optimization problem (OP) we mean the following:

    \text{(OP)} \begin{cases} \text{minimize } & f_0(x) \quad \text{(objective)}\\ \text{subject to }& f_i(x) \leq 0, \quad i=1,\ldots,m \quad \text{(inequality constraints)}\\ & h_i(x) = 0, \quad i=1,\ldots,p \quad \text{(equality constraints)}  \end{cases}.

    We call

    f_0:\mathbb{R}^n \to \mathbb{R} the objective function;
    x \in \mathbb{R}^n the optimization variable or parameters;
    f_i:\mathbb{R}^n \to \mathbb{R}, i=1,\ldots,m, the inequality constraint functions; and
    h_i:\mathbb{R}^n \to \mathbb{R}, i=1,\ldots,p, the equality constraint functions.

    The domain of (OP) is the intersection

    D = \bigcap_{i=0}^{m} \text{dom} \, f_i \cap \bigcap_{i=1}^{p} \text{dom} \, h_i.
















    Feasibility Consider an (OP) as above.
    Feasible point: those x \in D satisfying

    \begin{aligned} f_i(x)&\leq 0\quad\text{for } i =1,\ldots,m\\  h_i(x) &= 0 \text{ for }i=1,\ldots,p. \end{aligned}

    Feasible set: the subset F \subset D consisting of the feasible points.
    Feasible problem: A problem with nonempty feasible set, i.e., F \neq \emptyset.
    Infeasible problem: A problem with empty feasible set; i.e., there are no x \in D which satisfy the inequality and equality constraints.

    Remark.
    1. A feasible problem need not have a solution; e.g., f(x) = e^x has no minimizer nor minimum on \mathbb{R} .
    2. An infeasible problem never has a solution–there are no parameters x to even test.















    Basic Example Consider the problem

    \begin{cases} \text{minimize } & \log(1-x^2-y^2)\\  \text{subject to }& (x-1)^2+(y-1)^2-1\leq0\\ & (x-y-1)^2+(y-1)^2-1\leq0 \end{cases}.

    The objective function is

    \begin{aligned}  f_0(x,y) &= \log(1-x^2-y^2)\\ \text{dom}\,f_0 &= \{ (x,y) \in \mathbb{R}^2 : x^2+y^2<1\} \end{aligned} ,

    The inequality constraint functions are

    \begin{aligned} f_1(x,y) &=(x-1)^2 + (y-1)^2-1\\ f_2(x,y) &=(x-y-1)^2 + (y-1)^2 - 1\\ \text{dom}\,f_1 &= \text{dom}\,f_2 = \mathbb{R}^2. \end{aligned} .

    The domain of the problem is

    D = \text{dom}\, f_0 \cap \text{dom}\, f_1 \cap \text{dom} f_2 = \{ x^2+y^2<1\}. .

    The feasible set: Let

    \begin{aligned} A &= \text{dom}\,f_0\\ B &= \{(x-1)^2+(y-1)^2-1\leq0\}\\ C &= \{(x-y-1)^2 +(y-1)^2-1\leq0\} \end{aligned} .

    These three sets are depicted in the image below.
    Note that the darkest region given by A \cap B \cap C is the feasible set.
    Can we solve the problem? Noting
    1. \log(1-x^2-y^2) \to -\infty as (x,y) approaches a point on the circle \{ x^2+y^2 = 1 \} , and
    2. such sequences exist in the feasible set,
    we conclude the problem does not have a solution.















    The Feasibility Problem Feasibility problem: Given an (OP) with

    \begin{aligned} &\text{inequality constraint functions } f_i, i = 1,\ldots,m\\ &\text{equality constraint functions } h_i, i = 1, \ldots,p \end{aligned}

    solve

    \begin{cases} \text{find} & x\\ \text{subject to} & f_i(x) \leq 0, i = 1, \ldots, m\\ & h_i(x) = 0, i = 1,\ldots, p \end{cases}.

    Viz.: the feasibility problem determines whether the constraints are consistent.

    Example 1. The problem

    \begin{cases} \text{find} & (x,y)\\ \text{subject to} &f_1(x,y) = x^2 + y^2 - 1 \leq 0\\ &f_2(x,y) = (x-1)^2 + y^2 - 1 \leq 0 \end{cases}

    has a solution since the two inequality constraints describe two intersecting disks.
    This is depicted below.


    Example 2. The problem

    \begin{cases} \text{find} & (x,y)\\ \text{subject to} &f_1(x,y) = x^2 + y^2 - 1 \leq 0\\ &f_2(x,y) = (x-1)^2 + y^2 - 1 \leq 0\\ &h_1(x,y) = (x-\frac{1}{2})^2 + y^2 - 1 =0 \end{cases}

    has no solution since the circle given by h_1=0 lies outside of the intersection of the two disks.
    This is depicted below, where the red circle is given by h_1=0 .















    Optimal Value and Solvability Recall:

    \begin{aligned} F &= \text{ feasible set of problem}\\ D &= \text{ domain of problem }. \end{aligned}

    Optimal value: The value

    p^\star = \inf \{ f_0(x) : x \in F \},

    i.e., p^\star is the largest p \in \mathbb{R} such that p\leq f_0(x) for all x \in F .
    N.B.: p^\star \in \mathbb{R} \cup \{ -\infty, + \infty \} .

    Example: Below depicts the graph of f(x) = \frac{1}{x} on \mathbb{R}_{++} .
    Evidently, \inf \{ \frac{1}{x} : x \in \mathbb{R}_{++} \} = 0 .




    Solvable: When the problem satisfies

    there exists x^\star \in F with f_0(x^\star) = p^\star,

    i.e., the minimum value p^\star is attainable.

    Example: Below depicts the graph of a quartic q(x) .
    The problem of minimizing q(x) on \mathbb{R} is solvable with solution given by the minimal point A .
    N.B.: Point B is a local minimum and hence does not give a solution.






    Remarks.
    1. p^\star = \min\{f_0(x):x \in F \} iff the (OP) is solvable.
      Indeed, \min\{f_0(x):x \in F \} is not well-defined unless the (OP) is solvable.
    2. p^\star need not be finite:
      p^\star = -\infty if f_0 is unbounded below on the feasible set; and
      p^\star = + \infty if the OP is infeasible.



















    Standard Form Optimization problems need not be placed in the form we defined them.
    We therefore introduce the following definition.

    (OP) in Standard form:

    \text{(OP)} \begin{cases} \text{minimize } & f_0(x) \\ \text{subject to }& f_i(x) \leq 0, \quad i=1,\ldots,m \\ & h_i(x) = 0, \quad i=1,\ldots,p  \end{cases}.

    (This is how we defined (OP) before.)




    Example: Rewriting in standard form. We can recast more general optimization problems in standard form; e.g., consider

    \text{(OP2)} \begin{cases} \text{maximize } & F_0(x) \\ \text{subject to }& F_i(x) \leq G_i(x), \quad i=1,\ldots,m \\ & H_i(x) = K_i(x), \quad i=1,\ldots,p \end{cases}.

    Indeed, taking
    f_0 = -F_0 (noting \min f_0 = -\max F_0)
    f_i = F_i - G_i for i=1,\ldots,m
    h_i = H_i - K_i for i=1,\ldots,p
    we readily recast (OP2) into the standard form (OP).















    Equivalent Problems Suppose we are given two OP’s: (OP1) and (OP2).
    We say (OP1) and (OP2) are equivalent if: solving (OP1) allows one to solve (OP2), and vice versa.
    N.B.: Two problems being equivalent does not mean the problems are the same nor that they have the same solutions.

    Example. Consider the two problems:

    \begin{cases} \text{minimize} & f(x) = x^2\\ \text{subject to}& x \in [1,2] \end{cases}  \quad \text{and} \quad  \begin{cases} \text{minimize}&g(x) = (x+1)^2+1 \\ \text{subject to}&x \in [0,1] \end{cases}.

    Observing

    x^\star \in [1,2] minimizes f(x) on [1,2]

    iff

    x^\star - 1 \in [0,1] minimizes g(x) on [0,1] ,

    we readily see the two problems are equivalent.
    Indeed, if we find the solution x^\star = 1 to the first problem, we readily obtain the solution x^\star - 1 = 0 to the second problem, and vice versa.















    Change of Variables Suppose \phi:\mathbb{R}^n \to \mathbb{R}^n is an injective function with D \subset \phi(\text{dom} \,\phi).
    Then, under the change of variable x \mapsto \phi(x) , we have

    \text{(OP1)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}

    is equivalent to

    \text{(OP2)} \begin{cases} \text{minimize} & f_0(\phi(x))\\ \text{subject to}& f_i(\phi(x)) \leq 0, i=1,\ldots,m\\ & h_i(\phi(x)) = 0, i=1,\ldots,p \end{cases}.

    N.B.: such a change of variables does not change the optimal value p^\star .
    Moreover, injectivity may be dropped.

    Justification. Indeed,
    • if x solves (OP1), then \phi^{-1}(x) solves (OP2).
      (More generally, z such that \phi(z) = x solves (OP2).)
    • if z solves (OP2), then \phi(z) solves (OP1).




    Example Consider the problem

    \begin{cases} \text{minimize} & e^x\\ \text{subject to} & \sqrt{x} - y \leq 0\\ &y-5 \leq 0\\ &x-5\leq 0 \end{cases} .

    In the image below, the shaded region is the feasible set and the curve is the graph of f_0(x) = e^x .
    Consider the change of variables

    \phi(x) = x^2 .

    The objective and constraints change as follows:

    \begin{aligned} f_0(x) = e^x & \to f_0(\phi(x)) = e^{x^2}\\ \sqrt{x}-y \leq 0 & \to |x| - y \leq 0\\ y-5 \leq 0 & \to y - 5 \leq 0\\ x - 5 \leq 0 & \to x^2 - 5 \leq 0 \end{aligned}

    The new feasible region and objective function are plotted below.
    Evidently, this change of variable changed a nonconvex (OP) into a convex one.















    Eliminating Linear Constraints Let A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m and x_0 \in \mathbb{R}^n a solution to Ax=b.
    Let B \in \mathbb{R}^{n\times k} be such that \text{range}\, B = \text{kernel}\, A.
    Then Ax=b iff x=By + x_0 for some y \in \mathbb{R}^{k}.

    Consequently

    \text{(OP1)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & Ax = b \end{cases}

    is equivalent to

    \text{(OP2)} \begin{cases} \text{minimize} & f_0(By+x_0)\\ \text{subject to}& f_i(By+x_0) \leq 0, i=1,\ldots,m\\ \end{cases}.

    N.B.: this can reduce dimension of problem by \text{rank}\, A many variables. (Recall: n = \text{rank}\,A + \text{null}\,A.)

    Justification. Indeed,
    • if x solves (OP1), then any y \in \mathbb{R}^m with x = By+x_0 solves (OP2), and
    • if y solves (OP2), then x = By + x_0 solves (OP1)


    Example. Consider the minimization problem

    \begin{cases} \text{minimize} & x^2 + y^2 \\ \text{subject to} &x\geq0\\ & y-x=1 \end{cases}.

    We may eliminate the variable y by simply using y=x+1 .

    But, to match with above: let

    \begin{aligned} f_0(x,y)=x^2 +y^2, &\quad A = \begin{bmatrix}-1&1\end{bmatrix}, \quad b = 1 \\ x_0 = \begin{bmatrix}0\\1\end{bmatrix},&\quad B = \begin{bmatrix} 1\\1\end{bmatrix} \end{aligned} .

    Thus

    A \begin{bmatrix}x\\y\end{bmatrix}=b \iff \begin{bmatrix}x\\y \end{bmatrix} = Bt + x_0 = \begin{bmatrix}t\\t+1\end{bmatrix}

    for some t \in \mathbb{R} , and so

    f_0(Bt + x_0) = f_0(t,t+1) = t^2 + (1+t)^2 .



    Therefore, the minimization problem becomes

    \begin{cases} \text{minimize} & t^2 + (t+1)^2 \\ \text{subject to} &t\geq0\\ \end{cases},

    which has the obvious solution t^\star = 0 with optimal value p^\star = 1 .
    Thus, the original problem has solution

    \begin{bmatrix}x\\y\end{bmatrix} = \begin{bmatrix}t\\t+1\end{bmatrix} = \begin{bmatrix}0\\1\end{bmatrix} .
















    Slack Variables Given f:\mathbb{R}^n \to \mathbb{R} with f(x) \leq 0 , then there is a variable s \geq 0 such that f(x) + s = 0 ; such a variable s is called a slack variable.

    Using slack variables s_i,i=1\ldots,m , the problem

    \text{(OP1)}\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i = 1,\ldots,p \end{cases}.

    is equivalent to the problem

    \text{(OP2)}\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & s_i \geq 0, i = 1,\ldots,m\\ & f_i(x)+s_i = 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}.



    Remarks.
    1. All of the x which may satisfy the constraints of (OP2) are the same as those which satisfy the constraints of (OP1); this justifies the equivalence.
    2. Let F_1 be the feasible set of (OP1) and F_2 that of (OP2).
      Then F_1 \subset \mathbb{R}^n and F_2 \subset \mathbb{R}^{n+m} ; i.e., the feasible sets are not the same object.
    3. Example: in the images below, the disk depicts a feasible set F_1 = \{ x^2+y^2-1 \leq 0 \} \subset \mathbb{R}^2 and the paraboloid-type set depicts the feasible set F_2 = \{x^2+y^2-1+s =0, s \geq 0 \} with slack variable s .
      N.B.: the permissible (x,y) coordinates are the same for both sets.


    Main point: Solving the system of equations

    \begin{aligned} f_i(x)+s_i &= 0, i=1,\ldots,m\\  h_i(x) &= 0, i=1,\ldots,p \end{aligned}

    and considering only those solutions with s_i \geq 0 may be easier than solving the system of inequalities

    \begin{aligned} f_i(x)&\leq0, i=1,\ldots,m\\  h_i(x) &= 0, i=1,\ldots,p \end{aligned}.



    Example. Consider

    \text{(OP1)} \begin{cases} \text{minimize} & f_0(x,y)\\ \text{subject to} & a_1 x + b_1 y - c_1 \leq 0\\ & a_2 x + b_2 y - c_2 = 0 \end{cases}.

    Introduce slack variable s\geq0 satisfying

    a_1 x + b_1 y - c_1 +s = 0.

    Then (OP1) is equivalent to the problem

    \text{(OP2)} \begin{cases} \text{minimize} & f_0(x,y)\\ \text{subject to} & s\geq0\\ & a_1 x + b_1 y - c_1 + s = 0\\ & a_2 x + b_2 y - c_2 = 0 \end{cases}.

    Thus, finding feasible (x,y,s) is just a matter of solving a system of equations and choosing those (x,y,s) with s\geq0 .

    Moreover, one can solve the problem

    \text{(OP3)} \begin{cases} \text{minimize} & f_0(x,y)\\ \text{subject to} & a_1 x + b_1 y - c_1 + s = 0\\ & a_2 x + b_2 y - c_2 = 0 \end{cases}.

    and just choose solutions with s \geq 0 to obtain solutions to (OP2), and hence (OP1).















    Epigraph Form Recall: \text{epi}\, f = \{(x,t): x \in \text{dom}\,f, t \geq f(x) \}.

    The optimization problem

    \text{(OP1)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}

    is equivalent to its epigraph form

    \text{(OP2)} \begin{cases} \text{minimize} & t\\ \text{subject to} &f_0(x) - t \leq0\\ & f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}

    Viz., minimizing f_0 subject to constraints is equivalent to finding the smallest t such that (x,t) \in \text{epi}\, f for some feasible x .

    Proof by picture. The dark curve and shaded region below indicate the epigraph of a function f .
    The red dot indicates the minimum point (x^\star,p^\star) .
    The black dots indicate points (x^\star,t) \in \text{epi}\,f for different values of t .
    Evidently, the smallest t^\star for which (x^\star,t^\star) \in \text{epi}\, f is given by t^\star = p^\star .















    Fragmenting a Problem Proposition. Given f: \mathbb{R}^n \to \mathbb{R} and sets F,F_1,\ldots,F_q with F = F_1 \cup \cdots \cup F_q, let

    \begin{aligned} p^\star &= \inf \{f(x):x \in F\}\\ p_i^\star &= \inf\{ f(x): x \in F_i\}, \quad i = 1,\ldots, q. \end{aligned}

    Then

    p^\star = \min\{p_i^\star: i = 1,\ldots,q\}.

    (Assuming \min\{-\infty,a\} = -\infty for any real number a .)


    Viz., to minimize a function on a set F, one may instead minimize f over pieces of F and then take the minimum optimal value of this procedure.


    Example. Consider the (OP)

    \text{(OP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & x \in F \end{cases}

    where F \subset \text{dom}\, f and where the feasible set F is depicted below.
    Consider breaking up F into three regions F1,F2,F3 as indicated below.
    Now formulate the (OP)’s

    \text{(OPi)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & x \in Fi \end{cases}

    for i = 1,2,3 , and let p_i^\star be the optimal value for (OPi).
    Using the preceding proposition, the optimal value of (OP) is given by

    p^\star = \min \{ p_1^\star, p_2^\star, p_3^\star \} .

    Conclusion: solving (OP), whose feasible set is not convex, may be achieved by solving three subproblems (OP1),(OP2),(OP3) whose feasible sets are convex.















    Basics of Convex Optimization
    Convex Optimization Problems Abstract convex optimization problem: A problem involving minimizing a convex objective function on a convex set.
    Convex optimization problem: a problem of the form

    \text{(COP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, i = 1,\ldots,m\\ & a_i^T x = b_i, i =1,\ldots, p \end{cases},

    where
    f_0:\mathbb{R}^n \to \mathbb{R} and f_i:\mathbb{R}^n \to \mathbb{R} are convex; and
    a_i \in \mathbb{R}^n and b_i \in \mathbb{R} are fixed.

















    Some Remarks
    Remark 1. As defined, a (COP) is an (OP) in standard form; naturally, there are nonstandard form (OP)’s equivalent to (COP)’s.
    E.g., the abstract (COP)

    \begin{cases} \text{minimize} & f_0(x,y) \\ \text{subject to} & (x+y+1)^2 = 0 \end{cases}

    is readily seen to be equivalent to the standard form (COP)

    \begin{cases} \text{minimize} & f_0(x,y) \\ \text{subject to}& \begin{bmatrix}1&1\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix} + 1 = 0 \end{cases} .







    Remark 2. We emphasize: the equality constraints are assumed to be affine constraints.
    Moreover, the equality constraints

    a_i^T x = b_i , \quad i=1,\ldots,p

    can be rewritten as

    A x = b ,

    where

    \begin{aligned} A &= \begin{bmatrix} a_1^T\\a_2^T\\\vdots\\a_p^T\end{bmatrix} \in \mathbb{R}^{p \times n}, \quad b = \begin{bmatrix}b_1\\b_2\\\vdots\\b_p\end{bmatrix} \in \mathbb{R}^{p} \end{aligned} .







    Remark 3. The affine assumption on the equality constraints can be lifted at the possible expense of an intractable theory/numerical analysis.
    E.g., if h: \mathbb{R}^n \to \mathbb{R} is quasilinear, then h(x) = 0 defines a convex set.





    Remark 4. Generally, h(x) convex does not imply the level set h(x) = 0 is convex; e.g., h(x,y)=x^2+y^2-1 gives a sphere.





    Remark 5. The common domain

    D = \bigcap_{i=0}^m \text{dom}\, f_i

    is convex since it is an intersection of convex sets.















    Optimality for Convex Optimization Problems Assume throughout that f_0:\mathbb{R}^n \to \mathbb{R} is the objective function for some given (COP) and that F is the feasible set.

    Proposition 1. If x^\star is a feasible local minimizer for a (COP), then it is the global minimizer for the (COP).
    Proof. We will follow a proof by contradiction; i.e., we will show that assuming x^\star is not a global minimizer leads to a contradiction.
    Step 1. x^\star being a feasible local minimizer means x^\star \in F and that there is a R>0 such that

    f_0(x^\star) = \inf\{ f_0(z) : z \in F, \quad \Vert x-x^\star \Vert_2 \leq R \} ;

    i.e., f_0(x^\star) \leq f_0(z) for all z \in F with a distance at most R of x^\star .



    Step 2. Supposing x^\star is not a global minimizer, then there exists y \in F such that f_0(y) < f_0(x^\star) .
    By choice of R , there must also hold \Vert{y-x^\star}\Vert_2 > R .



    Step 3. Set

    \begin{aligned} z &= (1-t)x^\star + ty\\ t &= \frac{R}{2\Vert y-x^\star \Vert_2}, \end{aligned}

    noting that t \in [0,1] by Step 2. and so z \in F since F is convex.
    It follows that

    \begin{aligned} \Vert z - x^\star \Vert_2 &= \Vert (1-t)x^\star + ty - x^\star \Vert_2\\ &= \Vert t(y-x^\star ) \Vert_2\\ &= \frac{R}{2\Vert y-x^\star\Vert_2} \Vert y-x^\star\Vert_2\\ &= \frac{R}{2}\\ &< R \end{aligned}





    Step 4. Since z is a convex combination of feasible points, since f_0 is convex and since f_0(y)<f_0(x^\star) , there holds

    \begin{aligned}  f_0(z) &= f_0((1-t)x^\star + ty) \\ &\leq (1-t)f_0(x^\star) + t f_0(y)\\ & < (1-t) f_0(x^\star) + t f_0(x^\star)\\ &= f_0(x^\star). \end{aligned}

    But, since x^\star minimizes f_0 on

    \{x \in F : \Vert x - x^\star \Vert_2 \leq R \}

    and since

    \Vert z - x^\star \Vert_2 \leq R

    we also have

    f_0(x^\star) \leq f_0(z).

    This is a contradiction and so x^\star must be a global minimizer.








    Proposition 2. If f_0 is differentiable on F , then x^\star \in F is a minimizer iff for all y \in F there holds

    \nabla f_0(x^\star)^T (y - x^\star) \geq 0 .

    Proof.
    Step 0. N.B.: since f_0 is differentiable and convex on F , then for each x,y \in F there holds

    f_0(y) \geq f_0(x) + \nabla f_0(x)^T (y-x) .

    (C.f., Convex Function Theory.First Order Characterization.)



    Step 1.(\implies ) Suppose x^\star is a minimizer and suppose for contradiction that

    \nabla f_0(x^\star)^T(y-x^\star)<0

    for some y \in F .
    Set z_t = ty+(1-t)x^\star , noting that z_t \in F since F is convex.
    Using

    \frac{d}{dt} f_0(z_t)|_{t=0} = \nabla f_0(x^\star)^T (y-x^\star) < 0 ,

    we conclude f_0 is decreasing near z_0 = x^\star in the direction y-x^\star and so f_0(z_t)< f_0(x^\star) for small t .
    Since z_t \in F , this contradicts x^\star being a minimizer.
    Additional justification Since z_t defines a line passing through x^\star with direction y-x^\star , it follows that \frac{d}{dt} f_0(z_t)|_{t=0} is the directional derivative in direction (y-x^\star), i.e., \nabla f_0(x^\star)^T(y-x^\star).




    Step 2.(\impliedby ) Supposing

    \nabla f_0(x^\star)(y-x^\star) \geq 0

    for all y \in F and using the first order characterization at x^\star , namely,

    f_0(y) \geq f_0(x^\star) + \nabla f_0(x^\star)^T(y-x^\star)

    we readily conclude

    f_0(y) \geq f_0(x^\star)

    for all y \in F ; i.e., that x^\star is a minimizer for the problem.








    Corollary In case f_0 is differentiable and F = \text{dom}\,f_0 (equivalently, there are no nontrivial constraints), x^\star \in \text{dom}\,f_0 is a minimizer iff

    \nabla f_0(x^\star) = 0 .

    Proof. By Proposition 2., we have that x^\star \in \text{dom}\,f_0 is a minimizer iff

    \nabla f_0(x^\star)^T(y-x^\star) \geq 0

    for all y \in \text{dom}\, f_0 .
    Differentiability of f_0 requires \text{dom}\,f_0 is open and so, for small t \in \mathbb{R} , there holds

    y: = x^\star - t \nabla f_0(x^\star) \in \text{dom}\,f_0 .

    But then

    \begin{aligned} \nabla f_0(x^\star)^T(y-x^\star) &= \nabla f_0(x^\star)^T(x^\star - t \nabla f_0(x^\star) - x^\star)\\ &= -t\nabla f_0(x^\star)^T\nabla f_0(x^\star)\\ &= -t \Vert \nabla f_0(x^\star) \Vert_2^2\\ &\geq 0, \end{aligned}

    which is only possible for t>0 iff \nabla f_0(x^\star) = 0 .















    Some Examples
    Example 1. Let

    \begin{aligned} Q \in &\boldsymbol{S}_+^n, \quad a \in \mathbb{R}^n\quad b \in \mathbb{R}\\ f_0(x)&= \frac{1}{2} x^T Q x + a^T x + b. \end{aligned}

    Consider the unconstrained problem:

    \text{(OP)} \begin{cases} \text{minimize} & f_0(x) = \frac{1}{2} x^T Q x + a^T x + b\\ \text{subject to }& x \in \mathbb{R}^n \end{cases} .

    Note that \nabla^2 f_0 = Q \succeq 0 implies f_0 is convex.
    C.f.,Convex Function Theory.Second Order Characterization.


    By the preceding corollary, we have x^\star is a solution to (OP) iff

    \nabla f_0(x^\star) = Qx^\star + a = 0.

    Thus solvability of (OP) rests on whether -a \in \text{range}\, Q.

    Three cases:
    1. -a \notin \text{range}\,Q \implies f_0 is unbounded below and hence (OP) is unsolvable;
    2. Q \succ 0 \implies Q is invertible and so x^\star = -Q^{-1}a is the unique solution to (OP);
    3. Q \not\succ 0 and -a \in \text{range}\,Q \implies Qx^\star = -a has an affine set of solutions.






    Example 2. Let

    \begin{aligned} f_0:\mathbb{R}^n \to \mathbb{R}&  \text{ be convex and differentiable}\\ A &\in \mathbb{R}^{m \times n}, \quad b \in \mathbb{R}^{m} \end{aligned}

    and consider the problem

    \text{(OP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & Ax = b \end{cases} .

    Using a preceding proposition, x^\star satisfying Ax^\star = b is a minimizer iff

    \nabla f_0(x^\star)^T(y-x^\star) \geq0

    for all y satisfying Ay= b.


    Two cases
    1. Ax=b is an inconsistent system \implies the problem is infeasible.
    2. Ax=b is a consistent system; then

      Ax^\star = b, Ay=b

      \iff

      y = x^\star + x for some x \in \text{null}A .



    In case 2., we have

    \nabla f_0(x^\star)^T(y-x^\star) = \nabla f_0(x^\star)^Tx \geq 0

    for all

    x \in \text{null}A and y = x^\star + x .

    Since \text{null}A is a linear space, this is only possible iff

    \nabla f_0(x^\star)^Tx = 0 for all x \in \text{null}\,A ,

    i.e., iff

    \nabla f_0(x^\star) \perp \text{null}A.

    But \text{null}A = \text{range}A^T and so this condition means there exists \nu \in \mathbb{R}^n such that

    \nabla f_0(x^\star) + A^T \nu = 0 .

    This is just a Lagrange multiplier condition, as we will see later.















    Linear Programming Linear program: a (COP) of the form

    \text{(LP)} \begin{cases} \text{miminize} & c^T x \\ \text{subject to} & Gx \preceq h\\ & Ax= b \end{cases} ,

    where

    \begin{aligned} &G \in \mathbb{R}^{m \times n}, \quad A \in \mathbb{R}^{p\times n}\\ &h \in \mathbb{R}^m, \quad b \in \mathbb{R}^p, \quad x \in \mathbb{R}^n. \end{aligned}

    The feasible set F is a polyhedron (see below).

    Recall (\preceq ): For a,b \in \mathbb{R}^n the vector inequality

    a \preceq b

    means

    a_1 \leq b_1, \, a_2 \leq b_2, \, \ldots, \, a_n \leq b_n .



    Different than: A,B \in \mathbb{R}^{n \times n} satisfying the matrix inequality A \preceq B , which means B - A is positive semidefinite.







    Determining the Feasible set:
    Step 1. (A x = b ) Given A \in \mathbb{R}^{p \times n} , b \in \mathbb{R}^p , then

    \{x: Ax=b \}

    is an affine subspace of \mathbb{R}^n or empty.


    Step 2. Given \gamma \in \mathbb{R}^n, \eta \in \mathbb{R}, then

    \{x:\gamma^Tx \leq \eta \}

    is a half space in \mathbb{R}^n .


    Step 3. (G x \preceq h ) Given

    g_i \in \mathbb{R}^n, \quad G = \begin{bmatrix} g_1^T \\ g_2^T \\ \vdots \\ g_m^T \end{bmatrix} \in \mathbb{R}^{m\times n}, \quad Gx = \begin{bmatrix} g_1^Tx\\g_2^Tx\\ \vdots \\ g_m^Tx \end{bmatrix}

    Step 2. implies

    \{x : G x \preceq h \} = \{x : g_i^Tx\leq h_i, i = 1,\ldots,m\}

    is a finite intersection of half spaces.


    Step 4. Steps 1. and 3. imply the feasible F to (LP) is the finite intersection of half spaces and an affine space, i.e., F is a polyhedron.
    (c.f. Convex Geometry.Polyhedra.)








    Example. Let

    \begin{aligned} c&=\begin{bmatrix}0.1\\1\end{bmatrix},\, G= \begin{bmatrix} 0.8&0.8\\ -1&-1\\ 0&-1\\ 0&1 \end{bmatrix},\, h= \begin{bmatrix} 4\\-3\\-1\\3 \end{bmatrix} \end{aligned}.

    Consider the resulting (LP):

    \text{(LP)} \begin{cases} \text{miminize} & c^T x \\ \text{subject to} & Gx\preceq h \end{cases}.

    Explicitly, this (LP) is given by

    \text{(LP)} \begin{cases} \text{miminize} & 0.1x_1 + x_2\\ \text{subject to} & 0.8x_1 + 0.8x_2 \leq 4\\ & -x_1-x_2 \leq -3\\ &-x_2\leq -1\\ &x_2\leq3 \end{cases}.

    The feasible set F and the graph of the objective function over F are indicated in the image below.








    Remarks.
    1. Given d \in \mathbb{R} , have equivalent problem with objective c^Tx + d .
      Indeed,

      \begin{aligned} \min\{c^Tx + d :x \in F\} &= \min\{ c^Tx: x \in F \} + d\\ \text{argmin}\{c^Tx + d :x \in F\} &=\text{argmin}\{ c^Tx: x \in F \}. \end{aligned}

      WLOG: can assume d=0 to solve problem.


    2. Since

      \begin{aligned} \max\{c^Tx :x \in F\} &= -\min\{ -c^Tx: x \in F \}\\ \text{argmax}\{c^Tx :x \in F\} &=\text{argmin}\{ -c^Tx: x \in F \}. \end{aligned}

      one also calls the problem of maximizing c^Tx over a polyhedron a (LP).















    Example: Integer Linear Programming Relaxation. Integer Linear Program: an (OP) of the form

    \text{(ILP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ & x \in \mathbb{Z}^n \end{cases}

    where

    \begin{aligned} &G \in \mathbb{R}^{m \times n},\qquad h \in \mathbb{R}^m\\ \mathbb{Z}^n : &= \{x \in \mathbb{R}^n : x_i \text{ an integer for each } i=1,\ldots,n\} \end{aligned}

    The constraint x \in \mathbb{Z}^n is suitable for parameters which take on discrete quantities.
    N.B.: An (ILP) is not a convex problem, but may be approximated by one (see below).

    The feasible set Let

    F = \{x \in \mathbb{Z}^n: Gx \preceq h \}

    denote the feasible set of an (ILP).
    Then F is just the collection of integer vectors in the polyhedron

    P = \{ x \in \mathbb{R}^n : Gx \preceq h \} .



    Example. Consider an (ILP) with constraints given by (x,y) \in \mathbb{Z}^2 satisfying

    \begin{aligned} &-x\leq0\\ &-y\leq0\\ &y-2x\leq1\\ &y\leq2.5\\ &y+0.9x\leq4 \end{aligned}.

    The image below depicts the feasible set F (the collection of dots) and polyhedron P (shaded region).








    Remarks.
    1. Can of course also impose equality constraints Ax=b :

      \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ & Ax = b\\ & x \in \mathbb{Z}^n\\ \end{cases}

    2. If we impose x_i \in \{0,1\} , then the problem is called a boolean linear program:

      \text{(BLP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ & x_i \in \{0,1\} \end{cases}

      Suitable for when coordinates of x indicate when something is “off” or “on” or decision is “no” or “yes”.
      Can also use x_i \in \{-1,1\} instead.








    Relaxation of (ILP) The LP

    \text{(LP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h \end{cases}

    is called a relaxation of the (ILP) and is a convex approximation.
    Important points:
    1. The “tightest” convex relaxation is given by

      F=\{ x \in \mathbb{Z}^n : Gx \preceq h \} \to \text{convex hull of } F.

      Generally speaking, finding the convex hull is not an efficient way of approaching (ILPs).
    2. The relaxation (LP) of (ILP) is generally easier to solve, though exact algorithms exist for (ILP).
    3. If

      \begin{aligned} p_{LP}^\star &= \text{ optimal value for (LP)}\\ p_{ILP}^\star &= \text{ optimal value for (ILP)} \end{aligned}

      then p_{LP}^\star \leq p_{ILP}^\star.
      (Indeed the relaxation has larger feasible set.)
    4. If x^\star \in \mathbb{Z}^n solves the (LP), then it solves the (ILP).








    Relaxation of (BLP) Explicitly, a (BLP) is of the form

    \text{(BLP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ &x_i \in \{0,1\}, i=1,\ldots,n \end{cases}.

    The (LP)

    \text{(LP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ &0 \leq x_i \leq 1, i=1,\ldots,n \end{cases}

    is called a relaxation of the (BLP).
    N.B.: the relaxation

    F \to \{x \in [0,1]^n: Gx \preceq h \}

    generally provides a better approximation than the relaxation

    F \to \{x \in \mathbb{R}^n: Gx \preceq h \} .

    Indeed, the former biases approximate solutions to be close to being binary.







    Example. Problem Given m workers and n locations with m \leq n ,
    • assign each worker to work at some location
    • assign at most one worker to a location
    • minimize cost of operation and transportation
    Notational set up

    \begin{aligned} c_j &= \text{cost to operate at location }j\\ c_{ij} &= \text{cost to transport worker } i \text{ to location }j\\ x_j &=  \begin{cases} 0 & \text{ if location }j\text{ is not operating}\\ 1 & \text{ if location }j\text{ is operating} \end{cases}\\ x_{ij} &=  \begin{cases} 0 & \text{ if worker }i \text{ is not working at location }j\\ 1 & \text{ if worker }i \text{ is working at location }j \end{cases}. \end{aligned}

    Let

    \begin{aligned} \text{Cost vectors} & \begin{cases} c &= (c_j) \in \mathbb{R}^n\\ C &= (c_{ij}) \in \mathbb{R}^{m \times n} \end{cases}\\ \text{Optimization variables} &\begin{cases} x &= (x_j) \in \{0,1\}^n\\ X &= (x_{ij}) \in \{0,1\}^{m \times n} \end{cases} \end{aligned}.



    Construct objective function
    We find

    \begin{aligned} \text{total operational cost }&= c^Tx = c_1x_1 + \cdots +c_nx_n\\ \text{total transportation cost }&= \text{tr}\,(C^TX) = \sum_{i=1}^m \sum_{j=1}^n c_{ij}x_{ij}\\ \text{total cost }&= c^Tx + \text{tr}\,(C^TX). \end{aligned}

    Thus the objective function is f_0(x,X) = c^Tx + \text{tr}\,(C^TX).

    Construct constraints
    The constraint that x_i,x_{ij} are binary is of course natural for the problem.
    Since each worker is selected only once, we have

    \sum_{j=1}^n x_{ij} = 1, \quad \text{ for each } i=1,\ldots,m.

    Lastly, observe x_j=0 \implies x_{ij}=0 since the j th location not operating means it cannot host a worker; thus we have x_{ij} \leq x_j .

    Formulate Problem
    Putting everything together, the (BLP) formulation of the problem is

    \begin{cases} \text{minimize} & c^Tx + \text{tr}\,(C^TX)\\ \text{subject to} & x \in \{0,1\}^n\\ &X \in \{0,1\}^{m \times n}\\ &\sum_{j=1}^n x_{ij} =1, i = 1,\ldots,m\\ & x_{ij} \leq x_j, i=1,\ldots,m,j=1,\ldots,n \end{cases}.

    An (LP) relaxation (and hence convex approximation) of this (BLP) is

    \begin{cases} \text{minimize} & c^Tx + \text{tr}\,(C^TX)\\ \text{subject to} & x \in [0,1]^n\\ &X \in [0,1]^{m \times n}\\ &\sum_{j=1}^n x_{ij} =1, i = 1,\ldots,m\\ & x_{ij} \leq x_j, i=1,\ldots,m,j=1,\ldots,n \end{cases}.






















    Quadratic Programming Quadratic program: an (OP) of the form

    \text{(QP)} \begin{cases} \text{minimize}&\frac{1}{2}x^TQx + q^Tx\\ \text{subject to} &Gx \preceq h\\ &Ax =b \end{cases}

    where

    \begin{aligned} &Q \in \boldsymbol{S}^n, \quad G \in \mathbb{R}^{m \times n}, \quad A \in \mathbb{R}^{p \times n}\\ &q \in \mathbb{R}^n, \quad h \in \mathbb{R}^m,\quad  b \in \mathbb{R}^p. \end{aligned}

    If Q \in \boldsymbol{S}_+^n , then the problem is convex (see remarks).

    Remarks.
    1. As for (LP), the constraints

      \begin{cases} Gx \preceq h\\ Ax =b \end{cases}

      describe a polyhedron.








    2. Given d \in \mathbb{R} , then

      \begin{aligned} \min\{\frac{1}{2}x^TQx + q^Tx + d :x \in F\} &= \min\{ \frac{1}{2}x^TQx + q^Tx: x \in F \} + d\\ \text{argmin}\{\frac{1}{2}x^TQx + q^Tx + d :x \in F\} &= \text{argmin}\{ \frac{1}{2}x^TQx + q^Tx: x \in F \} \\ \end{aligned}

      WLOG: can assume d=0 to solve problem.








    3. If

      f_0(x) = \frac{1}{2}x^TQx + q^Tx ,

      then

      \begin{aligned} \nabla f_0(x) &= Qx + q\\ \nabla^2 f_0(x) &= Q. \end{aligned}

      Thus Q \succeq 0 implies f_0 is convex.
      N.B.: The factor \frac{1}{2} is just a convenient normalization.








    4. The generalization

      \begin{aligned} &Gx \preceq h \, \to \, \frac{1}{2} x^T Q_ix+g_i^Tx \leq h_i \\ &Q_i \in \mathbb{R}^{n\times n}, g_i \in \mathbb{R}^n, h_i \in \mathbb{R}, \quad i =1,\ldots,m \\ \end{aligned}

      results in quadratically constrained quadratic programming (QCQP).
      Imposing Q,Q_i \in \boldsymbol{S}_+^n ensures the (QCQP) is convex.
      (C.f., Convex Function Theory.Level Sets.)








    5. The generalization

      \begin{aligned} &Ax=b \, \to \, \frac{1}{2} x^T P_ix+a_i^Tx =b_i  \\ &P_i \in \mathbb{R}^{n\times n}, a_i \in \mathbb{R}^n, b_i \in \mathbb{R}, \quad i =1,\ldots,m \\ \end{aligned}

      also results in a (QCQP), but this can break convexity.
      E.g., The quadratic constraints

      x_i(x_i-1)=0 \text{ or } x_i^2 = 1

      result in a (BLP) since these constraints enforce x_i \in \{0,1\} or x_i \in \{-1,1\}, respectively.








    6. There holds

      \begin{aligned} \text{(QCQP)} + Q_i = P_i= 0  &\iff (QP)\\ (QP) + Q=0 &\iff (LP). \end{aligned}










    7. The assumption Q \in \boldsymbol{S}^n (i.e., Q = Q^T ) is not a serious restriction.
      Indeed, first note that, since x^TQx is a scalar, we have

      x^TQx = (x^TQx)^T = x^TQ^Tx .



      Let

      f(x) = \frac{1}{2}x^TQx + q^T x .



      Thus,

      \begin{aligned} f(x) &= \frac{f(x) + f(x)}{2}\\ &= \frac{1}{2}\left( \frac{1}{2}x^T Q x + q^Tx + \frac{1}{2}x^T Q^T x + q^Tx \right)\\ &=\frac{1}{2}\left( \frac{1}{2}x^T(Q+Q^T)x + 2 q^Tx \right)\\ &=:\frac{1}{2}x^T \tilde{Q} x + q^T x, \end{aligned}

      where

      \tilde{Q} = \frac{1}{2}(Q+Q^T) .



      Lastly, observe the symmetry of \tilde{Q} :

      \tilde{Q}^T = \left( \frac{1}{2}(Q+Q^T) \right)^T = \frac{1}{2}(Q^T+Q) = \tilde{Q}.



      Thus every quadratic

      f(x) = \frac{1}{2}x^TQx + q^Tx

      has a “symmetric representation” of the form

      f(x) = \frac{1}{2}x^T\tilde{Q}x + q^Tx with \tilde{Q} \in \boldsymbol{S}^n.
















    Example: Least Squares Least squares: an unconstrained (QP) of the form

    \text{(LS)} \begin{cases} \text{minimize}&\Vert Ax-b\Vert_2^2= x^TA^TAx-2b^TAx + b^Tb \end{cases}

    where A \in \mathbb{R}^{m \times n} and b \in \mathbb{R}^m .
    Features:
    • WLOG: may assume columns of A are linearly independent, and so

      m \geq n .

    • N.B.: a least norm solution to (LS) is generally given by

      x^\star = A^{\dagger}b ,

      where A^\dagger is the pseudo-inverse (aka Moore-Penrose inverse) of A .
    • Under the WLOG assumption, there holds

      x^\star = (A^TA)^{-1}A^Tb .



    Recall (definition of \Vert \cdot \Vert_2 ) For x \in \mathbb{R}^n , the notation \Vert x \Vert_2 means the vector norm

    \Vert x \Vert_2 := \sqrt{x_1^2 + \cdots + x_n^2 } .

    Thus

    \Vert x- y \Vert_2

    is the distance between the two vectors x,y \in \mathbb{R}^n .







    Rough Justification of WLOG Let

    \begin{aligned} A  &= \begin{bmatrix} a_1 & a_2 & a_3 \end{bmatrix} \in \mathbb{R}^{2 \times 3}, \quad a_1,a_2,a_3 \in \mathbb{R}^2 \\ x &=  \begin{bmatrix}  x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad b =  \begin{bmatrix} b_1 \\ b_2 \end{bmatrix}. \end{aligned}

    N.B.: a set of three 2-dimensional vectors is always linearly dependent and so

    a_3 = c_1 a_1 + c_2 a_2

    for some c_1,c_2 \in \mathbb{R}
    Therefore, we may write

    A=: \begin{bmatrix} a & \alpha & u \\ b & \beta & v  \end{bmatrix} = \begin{bmatrix} a & \alpha & c_1 a + c_2 \alpha\\ b & \beta & c_1 b + c_2 \beta \end{bmatrix} .



    Computing

    \begin{aligned} Ax &= \begin{bmatrix} a & \alpha & c_1 a + c_2 \alpha\\ b & \beta & c_1 b + c_2 \beta \end{bmatrix} \begin{bmatrix}  x_1 \\ x_2 \\ x_3 \end{bmatrix}\\ &= \begin{bmatrix} ax_1 + \alpha x_2 + c_1 a x_3 + c_2 \alpha x_3\\ bx_1 + \beta x_2 + c_1 b x_3 + c_2 \beta x_3 \end{bmatrix}\\ &= \begin{bmatrix} a(x_1 + c_1 x_3) + \alpha (x_2 + c_2 x_3)\\ b(x_1 + c_1 x_3) + \beta(x_2 + c_2 x_3) \end{bmatrix} \end{aligned}

    we see that minimizing

    \Vert Ax-b \Vert_2^2 = \bigg\Vert \begin{bmatrix} a(x_1 + c_1 x_3) + \alpha (x_2 + c_2 x_3)\\ b(x_1 + c_1 x_3) + \beta(x_2 + c_2 x_3) \end{bmatrix} - \begin{bmatrix} b_1\\b_2 \end{bmatrix} \bigg\Vert_2^2

    is equivalent to minimizing

    \begin{aligned} \Vert \hat{A}y - b \Vert_2^2 &:= \bigg\Vert \begin{bmatrix} a&\alpha\\b&\beta \end{bmatrix} \begin{bmatrix}y_1\\y_2\end{bmatrix}-\begin{bmatrix}b_1\\b_2\end{bmatrix} \bigg\Vert_2^2\\ &= \bigg\Vert \begin{bmatrix} ay_1 + \alpha y_2\\ by_1 + \beta y_2 \end{bmatrix} -\begin{bmatrix}b_1\\b_2\end{bmatrix} \bigg\Vert_2^2 \end{aligned}.

    Equivalence follows from the change of variables

    \begin{aligned} y_1 &= x_1 + c_1 x_3\\ y_2 &= x_2 + c_2 x_3  \end{aligned} .



    Therefore, a (LS) problem with matrix of size 2 \times 3 is equivalent to a (LS) problem with matrix of size 2 \times 2 .

    This argument holds in general: if A has linearly dependent columns, can use change of variables to ensure linear independence.







    Remarks.
    1. If Ax=b has solution x^\star , then x^\star solves the (LS) problem.

    2. If Ax=b has no solution, then any x^\star solving the (LS) problem gives a “best” estimate solution to Ax = b, where “best” is chosen to mean in terms of the vector norm \Vert \cdot \Vert_2 .

    3. While minimizing \Vert \cdot \Vert_2^2 is equivalent to minimizing \Vert \cdot \Vert_2 , the exponent 2 ensures \Vert \cdot \Vert_2^2 is differentiable (at 0 ).
      (C.f.: |x| is not differentiable at x=0 , but x^2 is.)








    Solving the (LS)
    Step 1. (The problem is convex) The objective function

    f_0(x) = \Vert Ax-b\Vert_2^2 = x^TA^TAx - 2b^TAx+b^Tb

    is convex since

    \nabla^2 f_0 = 2 A^TA \succeq 0 .

    To see A^TA \succeq 0 , observe:
    1. (A^T A)^T = A^T A and so A^TA \in \boldsymbol{S}^n , i.e., A^TA is symmetric;
    2. for all z \in \mathbb{R}^n there holds

      z^T A^T A z = (Az)^T Az = \Vert Az \Vert_2^2 \geq 0 ,

      i.e., A^TA \in \boldsymbol{S}_+^n .




    Step 2. (The critical points) We compute

    \begin{aligned} \nabla f_0 &= \nabla (x^T A^TA x) - \nabla(2b^TAx) + \nabla(b^Tb)\\ &= 2 A^TAx - 2A^Tb + 0\\ &= 2(A^TAx - A^Tb). \end{aligned}

    Therefore, \nabla f_0 = 0 iff

    A^TAx = A^T b .

    This system of equations consists of what are called the normal equations.



    Step 3. (The solution) By corollary proved above: x^\star is a solution iff \nabla f_0(x^\star) = 0 .
    Moreover, A having linearly independent columns ensures that A^TA is invertible:
    1. linear independent columns \implies Az=0 iff z=0.
    2. A^TAz = 0 \iff Az=0 \iff z = 0 .
    3. Since A^TA is square, can conclude A^TA is invertible.


    Lastly: since

    \nabla f_0(x^\star)= 0 iff A^TA x^\star = A^T b

    and since A^TA is invertible, we conclude that the solution to the (LS) is

    x^\star = (A^TA)^{-1}A^Tb .
















    Example: Distance Between Convex Sets Let K_1,K_2 \subset \mathbb{R}^n be convex subsets.
    Nearest Point Problem (NPP): Among x \in K_1, y \in K_2 , which pair (x,y) minimizes the distance \Vert x-y \Vert_2^2 ?



    The (NPP) may be expressed as a standard form (COP).
    Indeed, supposing

    \begin{aligned} K_1 &= \{ f_1(x) \leq 0 \}\\ K_2 &= \{ f_2(x) \leq 0 \}, \end{aligned}

    with f_1,f_2 convex, then the (NPP) for K_1,K_2 is the (COP)

    \begin{cases} \text{minimize} & \Vert x - y \Vert_2^2\\ \text{subject to} & f_1(x) \leq 0 \\ &f_2(y) \leq 0 \end{cases}.

    N.B.: the feasible set is

    \begin{aligned} F&=\{(x,y) \in \mathbb{R}^n \times \mathbb{R}^n : f_1(x) \leq 0, f_2(y) \leq 0 \}\\ &=K_1 \times K_2. \end{aligned}

    Viz.: (x,y) \in F iff x \in K_1, y \in K_2 .





    When the K_1,K_2 are polyhedra, then the (NPP) is a (QP).
    Indeed, supposing

    \begin{aligned} K_1 &= \{ G_1 x \preceq h_1 \}\\ K_2 &= \{ G_2 x \preceq h_2 \}, \end{aligned}

    then the (NPP) for K_1,K_2 is the (QP) given by

    \begin{cases} \text{minimize} & \Vert x - y \Vert_2^2\\ \text{subject to} & G_1 x \preceq h_1\\ & G_2 y \preceq h_2 \end{cases}.

    Can formulate as constrained least squares problem: defining

    \begin{aligned}  B,C \in \mathbb{R}^{n \times n} &\mapsto B \oplus C =  \begin{bmatrix} B & \vline & 0\\ \hline 0 & \vline & C \end{bmatrix} \in \mathbb{R}^{2n \times 2n}, \\ v,w \in \mathbb{R}^n &\mapsto  v \oplus w = \begin{bmatrix} v_1 & \cdots & v_n & w_1 & \cdots & w_n \end{bmatrix}^T \in \mathbb{R}^{n + n}\\ A &= \begin{bmatrix} Id_{n\times n} &\vline & - Id_{n\times n}\\ \hline 0 &\vline& 0 \end{bmatrix} \in \mathbb{R}^{2n\times 2n} \end{aligned}

    we readily see that the problem is equivalent to

    \begin{cases} \text{minimize} & \Vert A(x\oplus y) \Vert_2^2\\ \text{subject to} & (G_1 \oplus G_2) (x \oplus y) \preceq h_1 \oplus h_2 \end{cases}.







    Polyhedral Approximation. Let K_1,K_2 be nonempty convex domains, not necessarily polyhedral. Let P_1,P_2,Q_1,Q_2 be polyhedral domains such that

    \begin{aligned} &P_1 \subset K_1 \subset Q_1\\ &P_2 \subset K_2 \subset Q_2. \end{aligned}

    Then the (NPP) for the pairs [P_1,P_2] and [Q_1,Q_2] provide (QP) relaxations of the (NPP) for [K_1,K_2] and, respectively, provide upper and lower bounds for the optimal values for the original (NPP).















    Geometric Programming Monomial function: given

    c \geq 0,\, a_i \in \mathbb{R},\, i =1,\ldots,n ,

    a function of the form

    \begin{aligned} f&: \mathbb{R}^n \to \mathbb{R}\\ f(x)&= c x_1^{a_1} x_2^{a_2} \cdots x_n^{a_n}\\ \text{dom}\,f&=\mathbb{R}_{++}^n. \end{aligned}

    Example.

    3x_1^2x_2^{3.4}x_3^{-\pi}.









    Posynomial: given

    c_k \geq 0,\, a_{ik}\in \mathbb{R},\, i=1,\ldots,n,\, k=1,\ldots,K ,

    a function of the form

    \begin{aligned} f&: \mathbb{R}^n \to \mathbb{R}\\ f(x)&= \sum_{k=1}^K c_k x_1^{a_{1k}} x_2^{a_{2k}} \cdots x_n^{a_{nk}}\\ \text{dom}\,f&=\mathbb{R}_{++}^n. \end{aligned}

    Example.

    f(x) = \sqrt{2}x_1x_2\sqrt{x_3} + 3 x_1^4x_2^{-4}x_3^{-1.5}+x_5









    Geometric Programming: An (OP) of the form

    \text{(GP)} \begin{cases} \text{minimize}&f_0(x)\\ \text{subject to}&f_i(x) \leq 1, \, i=1,\ldots,m\\ & h_i(x) = 1, \, i=1,\ldots,p \end{cases},

    where

    \begin{aligned} f_0,f_1,\ldots,f_m & \text{ are posynomials}\\ h_1,\ldots,h_p & \text{ are monomials}. \end{aligned}

    N.B.: this is an undercover (COP).







    Remarks.
    1. Let

      \begin{aligned} Posy_n &= \text{ set of posynomials on } \mathbb{R}_{++}^n\\ Mon_n &= \text{ set of monomials on } \mathbb{R}_{++}^n. \end{aligned}

      Since any monomial is a posynomial, we have Mon_n \subset Posy_n .







    2. If

      \begin{aligned}  \lambda \geq 0\\ p(x),q(x) &\in Posy_n\\ f(x),g(x) &\in Mon_n , \end{aligned}

      then

      \begin{aligned} \lambda p(x),\, p(x) + q(x),\, p(x) q(x),\, \frac{p(x)}{f(x)} \in Posy_n\\ \lambda f(x),\, f(x)g(x),\, \frac{f(x)}{g(x)} \in Mon_n. \end{aligned}

      Thus Mon_n is a group and Posy_n a “conic representation” of Mon_n .







    3. As usual, we use the language “writing in standard form” to refer to writing an equivalent (OP) written in the form (GP) above.

      General (OPs) clearly equivalent to a (GP) may be called a geometric program in nonstandard form.

      For example, the geometric program

      \begin{cases} \text{maximize}&f_0(x)\\ \text{subject to} & f_i(x) \leq g_i(x)\\ & h_i(x) = k_i(x) \end{cases}

      with

      \begin{aligned} f_i  &\text{ are posynomials}\\ f_0,h_i,g_i\not\equiv 0 ,k_i\not\equiv0 & \text{ are monomials} \end{aligned}

      is readily rewritten as a standard form (GP):

      \begin{cases} \text{minimize}&\frac{1}{f_0(x)}\\ \text{subject to} & \frac{f_i(x)}{g_i(x)} \leq 1\\ & \frac{h_i(x)}{k_i(x)} = 1 \end{cases}.

















    Rewriting (GP) as a (COP) General (GPs) are not convex (e.g., f_0(x) = \sqrt{x} ).
    However, any (GP) is easily recast as a (COP) via change of variable.

    Step 1. (The change of variable) We will write x \mapsto y to mean the change of variable given by

    x_i = e^{y_i}.







    Step 2. (Monomials \to convex function) Let

    \begin{matrix} c>0, \quad b = \log c,& f(x) \in Mon_n,\\  a = \begin{bmatrix} a_1 & \cdots & a_n \end{bmatrix}^T \in \mathbb{R}^n,\quad & f(x) = c x_1^{a_1}x_2^{a_2} \cdots x_n^{a_n}. \end{matrix}

    Under the change of variable x \mapsto y:

    \begin{aligned} f(x) &= f(x_1,\ldots,x_n)\\ &= f(e^{y_1},\ldots, e^{y_n})\\ &= c(e^{y_1})^{a_1} \cdots (e^{y_n})^{a_n}\\ &= e^{\log c}e^{a_1y_1}\cdots e^{a_ny_n}\\ &= e^{a^Ty+b} \end{aligned}

    But

    F\, convex \implies e^F\, convex

    and so e^{a^Ty+b} is convex since affine functions are convex.







    Step 3. (Posynomial \to convex function) For k=1,\ldots,K , let

    \begin{matrix} c_k>0, \quad b_k = \log c_k,& f(x) \in Posy_n,\\  a_k = \begin{bmatrix} a_{1k} & \cdots a_{nk} \end{bmatrix} \in \mathbb{R}^{n},\quad & f(x) = \sum_{k=1}^K c_k x_1^{a_{1k}}\cdots x_{n}^{a_{nk}}. \end{matrix}

    By Step 2., there holds

    \begin{aligned} f(x) = f(y) = \sum_{k=1}^K e^{a_k^Ty + b_k}, \end{aligned}

    which is again a convex function.







    Step 4. ((GP) \to (COP)) We explicitly write the (GP) as:

    \begin{cases} \text{minimize}&f_0(x) = \sum_{k=1}^{K_0} c_{0k} x_1^{a_{01k}}\cdots x_n^{a_{0nk}}\\ \text{subject to}& f_i(x) = \sum_{k=1}^{K_i} c_{ik} x_1^{a_{i1k}} \cdots x_n^{a_{ink}} \leq 1, \, i=1,\ldots,m\\ & h_i(x) =d_{i}x_1^{b_{i1}}\cdots x_n^{b_{in}} = 1, \, i=1,\ldots,p \end{cases}.

    Let

    \begin{aligned} a_{ik} &= \begin{bmatrix} a_{i1k} & \cdots a_{ink} \end{bmatrix}^T \in \mathbb{R}^n\\ b_i &= \begin{bmatrix} b_{i1} & \cdots & b_{in} \end{bmatrix}^T \in \mathbb{R}^n\\ \alpha_{ik} &= \log c_{ik}\\ \delta_i &= \log d_{i}  \end{aligned}.

    Under the change of variable x \mapsto y , this (GP) becomes the (COP)

    \begin{cases} \text{minimize}&f_0(y) = \sum_{k=1}^{K_0} e^{a_{0k}^T y + \alpha_{0k}}\\ \text{subject to}& f_i(y) = \sum_{k=1}^{K_i}  e^{a_{ik}^T y + \alpha_{ik}} \leq 1, \, i=1,\ldots,m\\ & h_i(y) = e^{b_{i}^Ty + \delta_i} = 1, \, i=1,\ldots,p \end{cases}.









    Step 5. ((GP) in convex form) At last, since exponentiation may result in unreasonably large numbers, it is customary to take logarithms, resulting in the geometric problem in convex form:

    \begin{cases} \text{minimize}& \log \left( \sum_{k=1}^{K_0} e^{a_{0k}^T y + \alpha_{0k}} \right)\\ \text{subject to}& \log \left(\sum_{k=1}^{K_i}  e^{a_{ik}^T y + \alpha_{ik}}\right) \leq 0, \, i=1,\ldots,m\\ & b_{i}^Ty + \delta_i = 0, \, i=1,\ldots,p \end{cases}.

    N.B.:
    1. Concavity of \log is too weak to break the convexity of e^{a^Ty+b}, and so the problem is still convex.
    2. The constraints are equivalent since \log is monotonic and injective on \mathbb{R}_{++}.








    Example. (Taken from Boyd-Kim-Vandenberghe-Hassibi: A tutorial on geometric programming)
    Problem Maximize the volume of a box with
    • a limit on total wall area;
    • a limit on total floor area; and
    • upper and lower bounds on the aspect ratios height/width and depth/width .




    Notational set up

    \begin{aligned} \text{Optimization Variables}& \begin{cases} w &= \text{width}\\ h &= \text{height}\\ d &= \text{depth} \end{cases}\\ \text{Problem Parameters}& \begin{cases} A_{\text{wall}} &= \text{max wall area}\\ A_{\text{floor}} &= \text{max floor area}\\ \alpha_{-},\alpha_+ &= \text{lower and upper aspect ratio bounds for }h/w\\ \beta_{-},\beta_+ &= \text{lower and upper aspect ratio bounds for }d/w \end{cases} \end{aligned}





    Construct objective function
    The volume of the box is

    hwd

    and so the objective function is

    f_0(h,w,d) = hwd .

    N.B.: f_0 \in Mon_3



    Construct contraints

    \begin{aligned} \text{wall area bound }:&\quad 2hw+2hd \leq A_{\text{wall}}\\ \text{floor area bound }:&\quad wd \leq A_{\text{floor}}\\ \text{aspect ratio bounds }:&\quad \alpha_1 \leq \frac{h}{w} \leq \alpha_2\\ &\quad \beta_1 \leq \frac{d}{w} \leq \beta_2 \end{aligned}

    N.B.:

    \begin{aligned} 2hw+2hd &\in Posy_3\\ wd,\, hw^{-1},\, dw^{-1} &\in Mon_3 \end{aligned}





    Formulate Problem
    Putting everything together, we realize the problem may be formulated as the following (GP):

    \begin{cases} \text{maximize} & hwd\\ \text{subject to} & 2hw+2hd\leq A_{\text{wall}}\\ & wd \leq A_{\text{floor}}\\ & \alpha_1 \leq hw^{-1} \leq \alpha_2\\ & \beta_1 \leq dw^{-1} \leq \beta_2 \end{cases}.





    To write the problem in standard form: note the following equivalence of constraints

    \begin{aligned} 2hw+2hd\leq A_{\text{wall}} & \iff A_{\text{wall}}^{-1}2hw + A_{\text{wall}}^{-1}2hd \leq 1\\  wd \leq A_{\text{floor}} & \iff A_{\text{floor}}^{-1}wd \leq 1\\ &\\ \alpha_1 \leq hw^{-1} \leq \alpha_2 &\iff \begin{array}{l} \alpha_2^{-1}hw^{-1} \leq 1 \\  \alpha_1 h^{-1}w \leq 1 \end{array}\\ \beta_1 \leq dw^{-1} \leq \beta_2 & \iff \begin{array}{l} \beta_2^{-1} dw^{-1} \leq 1\\ \beta_1 d^{-1}w \leq 1 \end{array} \end{aligned}

    Moreover, maximizing hwd is equivalent to minimizing h^{-1}w^{-1}d^{-1} .



    Therefore, the problem in standard form is given by

    \begin{cases} \text{minimize} & h^{-1}w^{-1}d^{-1}\\ \text{subject to} & A_{\text{wall}}^{-1}2hw +A_{\text{wall}}^{-1}2hd \leq 1\\ & A_{\text{floor}}^{-1} wd \leq 1\\ &  \alpha_2^{-1} hw^{-1}\leq 1 \\ & \alpha_1 wh^{-1} \leq 1\\ & \beta_2^{-1}dw^{-1} \leq 1\\ & \beta_1 wd^{-1} \leq 1 \end{cases}.
















    Semidefinite Programming (Heavily influenced by Vandenberghe-Boyd Semidefinite Programming.)
    Linear matrix inequality (LMI): given

    \begin{aligned}  &F_0,F_1,\ldots,F_n \in \boldsymbol{S}^m\\ &x = \begin{bmatrix}x_1 & \cdots & x_n \end{bmatrix}^T \in \mathbb{R}^n\\ F(x) &:= F_0 + x_1 F_1 + \cdots + x_n F_n \end{aligned}

    an inequality of the form

    F(x) \succeq 0 .

    Recall: for A \in \boldsymbol{S}^m , we write A \succeq 0 to mean A is positive semidefinite, i.e., z^TAz \geq 0 for all z \in \mathbb{R}^m.





    Semidefinite program (SDP): an (OP) of the form

    \text{(SDP)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & F(x) \succeq 0, \end{cases}

    where

    \begin{aligned}  &F_0,F_1,\ldots,F_n \in \boldsymbol{S}^m\\ F(x) &:= F_0 + x_1 F_1 + \cdots + x_n F_n\\ c & \in \mathbb{R}^n \end{aligned}

    N.B.: The LMI F(x) \succeq 0 defines a feasible set which is convex and hence (SDPs) are convex problems.





    Convexity of Feasible Set. To see that (SDP) is a convex problem, first note: if

    t>0 and A \succeq 0,

    then

    z^T(tA)z = t ( z^TAz) \geq 0

    and so

    tA \succeq 0 .





    Next, observe: for x,y feasible and t \in [0,1] , the function

    F(x) = F_0 + x_1 F_1 + \cdots + x_n F_n

    evaluated at the convex combination

    tx + (1-t)y

    is

    \begin{aligned}  F(tx +(1-t)y) &= F_0 + (tx_1 + (1-t)y_1) F_1 + \cdots + (tx_n + (1-t)y_n)F_n. \end{aligned}

    Expanding, rearranging and using

    F_0 = tF_0 + (1-t)F_0

    gives:

    \begin{aligned}  F(tx +(1-t)y) &= tF_0  + tx_1F_1 + \cdots tx_nF_n \\ &+ (1-t)F_0 + (1-t)y_1F_1 + \cdots  + (1-t)y_nF_n\\ &= tF(x) + (1-t)F(y). \end{aligned}





    Using F(x),F(y)\succeq 0 , we conclude

    F(tx+(1-t)y) = tF(x) + (1-t)F(y) \succeq 0 .

    Thus, x,y feasible \implies tx + (1-t)y feasible for t \in [0,1], i.e., the feasible set is convex.







    Example 1. LPs are SDPs Consider the (LP)

    \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax + b \succeq 0  \end{cases},

    where

    A \in \mathbb{R}^{m \times n}, \quad b \in \mathbb{R}^{m}, \quad c \in \mathbb{R}^n

    and

    Ax+b \succeq 0

    means componentwise inequality.





    Given v = \begin{bmatrix}v_1 & \cdots & v_m \end{bmatrix}^T \in \mathbb{R}^m , define

    \text{diag}(v) =  \begin{bmatrix} v_1 & 0 & \cdots & 0\\ 0 & v_2 &\cdots &0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & v_m \end{bmatrix}.







    Since A \in \boldsymbol{S}^m satisfies A \succeq 0 iff A has nonnegative eigenvalues, we have

    v \succeq 0 \iff \text{diag}(v) \succeq 0.

    (Indeed, the eigenvalues of \text{diag}(v) are the components of v .) Therefore,

    \begin{aligned} Ax+b \succeq 0  &\iff \text{diag}(Ax+b) \succeq 0 \\ \text{(vector inequality)} & \iff \text{(matrix inequality)}. \end{aligned}







    Letting

    A=\begin{bmatrix}a_1 & \cdots & a_n \end{bmatrix}, \quad a_i \in \mathbb{R}^m ,

    we have

    Ax + b = b + x_1 a_1 + \cdots + x_n a_n .



    Therefore, using

    \begin{aligned} \text{diag}(v+\lambda w) &=  \begin{bmatrix}  v_1 + \lambda w_1 &0 & \cdots & 0\\ 0 & v_2 + \lambda w_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & v_n + \lambda w_n  \end{bmatrix}\\ &= \begin{bmatrix}  v_1  &0 & \cdots & 0\\ 0 & v_2  & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & v_n  \end{bmatrix} + \lambda \begin{bmatrix}  w_1 &0 & \cdots & 0\\ 0 &  w_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots &  w_n  \end{bmatrix}\\ &= \text{diag}(v) + \lambda \text{diag}(w) \end{aligned}

    we have

    \begin{aligned}  \text{diag}(Ax + b) &= \text{diag}(b + x_1 a_1 + \cdots + x_n a_n)\\ &= \text{diag}(b) + x_1 \text{diag}(a_1) + \cdots + x_n \text{diag}(a_n). \end{aligned}







    Therefore, defining

    \begin{aligned}  F_0 &= \text{diag}(b), \quad F_i = \text{diag}(a_i)\\ F(x) &= F_0 + x_1 F_1 + \cdots + x_n F_n = \text{diag}(Ax+b), \end{aligned}

    we conclude

    \begin{aligned} Ax+b \succeq 0 \iff \text{diag}(Ax+b) \succeq 0 \iff F(x) \succeq 0 \end{aligned} .







    In conclusion, we have that the (LP)

    \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax + b \succeq 0  \end{cases},

    is equivalent to the (SDP)

    \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & F(x)= \text{diag}(Ax+b) \succeq 0  \end{cases}.









    Example 2. Nonlinear OP as a SDP Consider the nonlinear (COP)

    \text{(OP1)} \begin{cases} \text{minimize}& \frac{(c^Tx)^2}{d^Tx}\\ \text{subject to} & Ax + b  \succeq 0\\ & d^Tx > 0 \end{cases}

    where

    \begin{aligned} &c,d \in \mathbb{R}^n\\ &A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m. \end{aligned}

    We will recast (OP1) as a (SDP).
    N.B.:
    1. d^Tx > 0 \implies choice of domain (a halfspace) of objective and ensures convexity.
    2. d^Tx < 0 \implies concave problem.




    To begin, recall we may recast problem in epigraph form:

    \text{(OP1)} \begin{cases} \text{minimize}& t\\ \text{subject to} & Ax + b  \succeq 0\\ & d^Tx > 0\\ & \frac{(c^Tx)^2}{d^Tx} \leq t \end{cases}.

    N.B.: this introduces the new optimization variable t .



    Goal: find a symmetric matrix-valued function

    F(x,t) = F_0 + x_1F_1 + \cdots + x_nF_n + t F_{n+1}

    such that the constraints

    \begin{cases} &Ax + b  \succeq 0\\ & d^Tx > 0\\ & \frac{(c^Tx)^2}{d^Tx} \leq t \end{cases}

    may be recast as the LMI

    \begin{cases} F(x) \succeq 0 \end{cases}.





    Idea: we know

    Ax + b  \succeq 0 \iff \text{diag}(Ax+b) \succeq 0 .

    On the other hand:

    \begin{aligned} \frac{(c^Tx)^2}{d^Tx} \leq t &\iff (c^Tx)^2 \leq td^Tx \\ &\iff td^Tx - (c^Tx)^2 \geq0  \end{aligned} .

    Recall: if \gamma>0 , then

    \begin{bmatrix}\alpha&\beta\\\beta&\gamma\end{bmatrix}\succeq0 \iff \alpha\geq0, \alpha\gamma - \beta^2 \geq 0

    Therefore, given d^Tx>0, we have

    \begin{aligned} \frac{(c^Tx)^2}{d^Tx} \leq t & \iff \begin{bmatrix}t & c^Tx\\ c^Tx & d^Tx \end{bmatrix} \succeq 0 \end{aligned} .





    Using that

    \begin{bmatrix} A &\vline &0\\ \hline 0 & \vline& B \end{bmatrix} \succeq 0 \iff A \succeq 0 , B \succeq 0,

    we therefore introduce

    E =  \begin{bmatrix} \text{diag}(Ax+b) & 0 & 0\\ 0 & t  & c^Tx\\ 0 & c^Tx & d^Tx \end{bmatrix}

    to capture the problems constraints.
    Indeed, evidently, E \succeq 0 iff

    \text{diag}(Ax+b) \succeq 0 and \begin{bmatrix} t & c^tx\\c^tx & d^tx \end{bmatrix} \succeq 0 .





    Therefore,

    \begin{array}{l}  Ax + b  \succeq 0\\  d^Tx > 0\\  \frac{(c^Tx)^2}{d^Tx} \leq t \end{array} \iff \begin{array}{l} \begin{bmatrix} \text{diag}(Ax+b) & 0 & 0\\ 0 & t  & c^Tx\\ 0 & c^Tx & d^Tx \end{bmatrix} \succeq 0 \end{array} .

    This is enough to conclude (OP1) may be recast as an (SDP).





    To make it clearer, introduce the notation

    A = \begin{bmatrix} a_1 & \cdots & a_n \end{bmatrix}, \quad a_i \in \mathbb{R}^m

    and (m+2)\times(m+2) matrices

    \begin{aligned} F_0 &= \text{diag}(b,0,0)  \\ F_i & =  \begin{bmatrix} \text{diag}(a_i) &0 &0\\ 0&0&c_i\\ 0&c_i&d_i \end{bmatrix}\\ F_{n+1} & = \begin{bmatrix}0_{m \times m} &0 &0\\0 & 1 & 0\\0 & 0 & 0\end{bmatrix}\\ \end{aligned}.

    Then

    \begin{aligned} \begin{bmatrix} \text{diag}(Ax+b) & 0 & 0\\ 0 & t  & c^Tx\\ 0 & c^Tx & d^Tx \end{bmatrix} = F_0 + x_1F_1 + \cdots + x_nF_n + t F_{n+1}:=F(x,t) \end{aligned}

    and so (OP1) is equivalent to the (SDP)

    \begin{cases} \text{minimize}& t\\ \text{subject to} & F(x,t) \succeq 0. \end{cases}
















    Lagrangian Duality

    Throughout, let (OP) denote a given optimization problem of the form

    \text{(OP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0 , \quad i=1,\ldots,m\\ & h_i(x) = 0, \quad i = 1,\ldots,p \end{cases}.

    Recall:

    \begin{aligned} \text{Problem domain: }& D := \bigcap_{i=0}^m \text{dom}\, f_i \cap \bigcap_{i=1}^p \text{dom}\,h_i\\ \text{Optimal value: } & p^\star := \inf\{f_0(x): x \in D , x \text{ feasible}\}. \end{aligned}

    The Lagrange Dual Lagrangian: the function

    \begin{aligned}  L&:\mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R} \\ \text{dom}\,L &= D \times \mathbb{R}^m \times \mathbb{R}^p \end{aligned}

    given by

    \begin{aligned} L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \end{aligned} .

    N.B.: for each fixed x \in D , the function

    (\lambda,\nu) \mapsto L(x,\lambda,\mu)

    is affine.





    Lagrange multipliers: the variables \lambda_i and \nu_i .
    The vectors

    \begin{aligned} \lambda :=  \begin{bmatrix}\lambda_1\\\lambda_2\\\vdots\\\lambda_m\end{bmatrix}, \quad  \nu := \begin{bmatrix}\nu_1\\\nu_2\\\vdots\\\nu_p\end{bmatrix} \end{aligned}

    are called dual variables.





    Lagrange dual function: the function

    \begin{aligned} g&:\mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R}\\ \end{aligned}

    given by

    g(\lambda,\nu) = \inf\{L(x,\lambda,\nu): x \in D\} .

    Those (\lambda,\nu) \in \mathbb{R}^m_+ \times \mathbb{R}^p satisfying

    g(\lambda,\nu) > -\infty

    are called dual feasible.
    N.B.: as an infimum of affine functions, g is automatically concave.





    Proposition. For

    \lambda \in \mathbb{R}^m_+, \quad \nu \in \mathbb{R}^p ,

    there holds

    g(\lambda,\nu) \leq p^\star,

    where p^\star is the optimal value for the given (OP).
    Proof.
    1. Let x \in D be feasible.
      Then

      \begin{aligned} f_i(x) &\leq 0, \quad i=1,\ldots,m\\ h_i(x) &=0, \quad i=1,\ldots,p. \end{aligned}







    2. Let \lambda \succeq 0 and \nu be arbitrary.
      Then feasibility of x implies

      \begin{aligned} \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) = \sum_{i=1}^m \lambda_i f_i(x) \leq 0. \end{aligned}

      Consequently,

      \begin{aligned}  L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \leq f_0(x). \end{aligned}







    3. Therefore, for all feasible x , for \lambda \succeq 0 and for arbitrary \nu , there holds

      \begin{aligned} g(\lambda,\nu) = \inf\{L(z,\lambda,\nu):z \in D \} \leq L(x,\lambda,\nu) \leq f_0(x) \end{aligned}

      and so

      g(\lambda,\nu) \leq p^\star .

      (Indeed, g(\lambda,\nu) is a lower bound of f_0(x) and p^\star is the greatest lower bound of f_0(x) .)






    Lagrangian as underestimator.
    (See CO 5.1.4)
    Define the indicator functions

    I_-(t)= \begin{cases} 0 &t \leq 0\\ +\infty & t> 0 \end{cases},\qquad I_0(t)= \begin{cases} 0 & t=0\\ +\infty & t\neq0 \end{cases}.

    Then

    \begin{aligned} I_-(f_i(x)) &\text{ indicates when the constraint }f_i \text{ is active}\\ I_0(h_i(x)) &\text{ indicates when the constraint }h_i \text{ is active} \end{aligned}

    and the (OP) is equivalent to

    \begin{aligned} \begin{cases} \text{ minimize } f_0(x) + \sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x)). \end{cases} \end{aligned}

    N.B.: the terms in

    \begin{aligned}\sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x)) \end{aligned}

    act as penalties for breaking the desired constraints.





    N.B.: if x is feasible and (\lambda,\nu) \in \mathbb{R}_+^m\times\mathbb{R}^p , then

    \begin{aligned} \lambda_i f_i(x) &\leq I_-(f_i(x))\\ \nu_i h_i(x) &\leq I_0(h_i(x)) \end{aligned}

    and hence

    \begin{aligned}L(x,\lambda,\nu) & = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \\ &\leq f_0(x) + \sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x))\end{aligned} .

    Viz., L is an underestimater of the objective function

    f_0(x) + \sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x))

    obtained by “softening” or “weakening” the penalty functions:

    \begin{aligned} I_-(t) & \to ct, \quad c>0\\ I_0(t) & \to bt, \quad b \in \mathbb{R}. \end{aligned}







    In particular, for each (\lambda,\nu) \in \mathbb{R}_+^m \times \mathbb{R}^p , the problem

    \begin{cases} \text{minimize } L(x,\lambda,\nu) \end{cases}

    has optimal value g(\lambda,\nu) and provides an underestimation of the original (OP).















    Example 1 Consider the least squares problem

    (LS) \begin{cases} \text{minimize} & x^Tx\\ \text{subject to}& Ax=b \end{cases}

    for given

    A = \begin{bmatrix} a_1^T \\ \vdots \\ a_p^T \end{bmatrix} \in \mathbb{R}^{p\times n}, \quad a_i \in \mathbb{R}^n, \quad b \in \mathbb{R}^p .

    N.B.:

    \begin{aligned} Ax = b \iff h_i(x) = a_i^Tx - b_i = 0 \text{ for } i=1,\ldots,p \end{aligned}.







    Therefore, the Lagrangian L for (LS) is

    \begin{aligned} L&:\mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\,L &= \mathbb{R}^n \times \mathbb{R}^p\\ L(x,\nu) &= x^Tx + \sum_{i=1}^p \nu_i (a_i^Tx - b_i) \\ &= x^Tx + \nu^T(Ax-b) \end{aligned}

    and the Lagrange dual is

    \begin{aligned} g&:\mathbb{R}^p \to \mathbb{R}\\ g(\nu) &= \inf\{ x^Tx + \nu^T(Ax-b) : x \in \mathbb{R}^n \}. \end{aligned}







    N.B.:

    \begin{aligned} \nabla^2_x L(x,\nu) &= \nabla_x^2 ( x^Tx + \nu^T(Ax-b)) \\ &= 2Id_{n \times n}\\ & \succeq 0  \end{aligned}

    and so x\mapsto L(x,\nu) is convex.
    Consequently,

    \begin{aligned} L(x^\star,\nu) = \inf\{L(x,\nu):x \in \mathbb{R}^n \} = \text{min}\{L(x,\nu):x \in \mathbb{R}^n \} \end{aligned}

    iff

    \nabla_x L(x^\star,\nu) = 2x^\star + A^T\nu = 0 ,

    i.e., iff

    x^\star = -\frac{1}{2}A^T\nu.







    In conclusion,

    \begin{aligned} g(\nu) &= L(x^\star,\nu)\\ &= (x^\star)^Tx^\star + \nu^T(Ax^\star-b)\\ &= \left(-\frac{1}{2}A^T\nu\right)^T\left(-\frac{1}{2}A^T\nu\right) +\nu^T\left(A\left(-\frac{1}{2}A^T\nu\right) - b\right)\\ &=\frac{1}{4}\nu^TAA^T\nu - \frac{1}{2}\nu^TAA^T\nu - \nu^Tb\\ &=-\frac{1}{4}\nu^TAA^T\nu - \nu^Tb. \end{aligned}

    In particular,

    -\frac{1}{4}\nu^TAA^T\nu - b^T\nu \leq \inf\{x^Tx:Ax=b\}.

    for all \nu \in \mathbb{R}^p.














    Example 2 Consider the linear program

    \text{(LP)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to}& Ax=b\\ & x\succeq 0 \end{cases}

    for given

    c \in \mathbb{R}^n, \quad A \in \mathbb{R}^{p \times n}, \quad b \in \mathbb{R}^p .

    N.B.:
    • equality constraints given by

      h_i(x) = a_i^Tx-b_i =0, \quad i=1,\ldots,p.

    • x\succeq 0 iff

      x_i \geq 0, \quad i=1,\ldots,n\quad

      iff

      \quad f_i(x) = -x_i \leq 0,\quad i=1,\ldots,n.







    Therefore, the Lagrangian for (LP) is

    \begin{aligned} L&: \mathbb{R}^n \times \mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\, L &=\mathbb{R}^n \times \mathbb{R}^n \times \mathbb{R}^p \\ L(x,\lambda,\nu) &= c^Tx - \sum_{i=1}^n \lambda_i x_i + \sum_{i=1}^p \nu_i(a_i^Tx-b_i)\\ &= c^Tx - \lambda^Tx + \nu^T(Ax-b)\\ &=(c - \lambda + A^T\nu)^Tx - \nu^Tb \end{aligned}







    Want to compute

    g(\lambda,\nu) = \inf\{ L(x,\lambda,\nu) : x \in \mathbb{R}^n \},

    but

    x \mapsto (c - \lambda + A^T\nu)^Tx - \nu^Tb

    is an affine function with domain \mathbb{R}^n .
    Therefore,

    x \mapsto (c - \lambda + A^T\nu)^Tx - \nu^Tb

    is bounded below iff (\lambda,\nu) satisfy

    c - \lambda + A^T\nu = 0 .







    Therefore, the Lagrange dual is

    \begin{aligned} g&:\mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\, g &= \{(\lambda,\nu) \in \mathbb{R}^n \times \mathbb{R}^n \times \mathbb{R}^p : c - \lambda + A^T\nu= 0\}\\ g(\lambda,\nu) &= \begin{cases} -b^T\nu & c - \lambda + A\nu^T = 0\\ -\infty & \text{else} \end{cases}. \end{aligned}



    In particular, for dual feasible (\lambda,\nu), there holds

    -b^T\nu \leq c^Tx

    for all feasible x .














    Return of Conjugate Function Recall: given f:\mathbb{R}^n \to \mathbb{R}, its conjugate function f^* is the convex function

    f^*(y) = \sup\{y^Tx - f(x) : x \in \text{dom}\,f\}.

    Interestingly: the conjugate function is related to the Lagrange dual.



    Example.
    Consider the the (OP)

    \text{(OP)} \begin{cases} \text{minimize}&f_0(x)\\ \text{subject to}&Ax \preceq b\\ &Cx=d \end{cases},

    for given

    \begin{aligned} f_0:\mathbb{R}^n \to \mathbb{R},\quad A \in \mathbb{R}^{m \times n}, \quad b \in \mathbb{R}^m\\ C \in \mathbb{R}^{p \times n}, \quad d \in \mathbb{R}^p. \end{aligned}

    We may write

    \begin{aligned} Ax\preceq b &\iff f_i(x) = a_i^Tx - b_i \leq0 , \quad i =1,\ldots, m\\ Cx = d &\iff h_i(x) =c_i^Tx - d_i=0, \quad i=1,\ldots,p \end{aligned}

    for suitable

    a_i ,c_i \in \mathbb{R}^n.







    The Lagrangian is

    \begin{aligned} L&:\mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\, L &= D\\ L(x,\lambda,\nu) &= f_0(x) + \sum_{i=1}^m \lambda_i(a_i^Tx - b_i) + \sum_{i=1}^p \nu_i(c_i^Tx - d_i)\\ &= f_0(x) + \lambda^T(Ax-b) + \nu^T(Cx-d)\\ &= f_0(x) + \lambda^TAx + \nu^TCx - \lambda^Tb - \nu^Td \end{aligned}







    We may now compute the Lagrange dual in terms of the conjugate f_0^* :

    \begin{aligned} g&:\mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R}\\ g(\lambda,\nu)&= \inf\{L(x,\lambda,\nu):x \in D\}\\ &=\inf\{f_0(x) + \lambda^TAx + \nu^TCx - \lambda^Tb - \nu^Td : x \in D\}\\ &=\inf\{ (A^T\lambda + C^T\nu)^Tx + f_0(x):x \in D\} - \lambda^Tb - \nu^Td \\ &=-\sup\{-(A^T\lambda+C^T\nu)^Tx - f_0(x) : x \in D\} - \lambda^Tb - \nu^Td\\ &=-f_0^*(-A^T\lambda-C^T\nu) - \lambda^Tb - \nu^Td. \end{aligned}

    Since

    \begin{aligned} g(\lambda,\nu)>-\infty \iff f_0^*(-A^T\lambda-C^T\nu)< +\infty \end{aligned},

    we conclude

    \begin{aligned}\text{dom}\,g = \{(\lambda,\nu)\in \mathbb{R}^m\times\mathbb{R}^p: -A^T\lambda-C^T\nu \in \text{dom}\,f_0^* \} \end{aligned}.
















    Example: A Volume Minimizing Ellipsoid Problem. Given points a_1,\ldots,a_m \in \mathbb{R}^n , among all (closed) origin centered ellipsoids \mathcal{E} satisfying a_i \in \mathcal{E} , find those with minimal volume.

    Plan. We will formulate this problem as a convex optimization problem and determine the Lagrange dual function.





    Positive semidefinite representation of ellipsoids.
    Given x' \in \mathbb{R}^n, X \in \boldsymbol{S}_{++}^n , the set

    \mathcal{E}_X := \{ x \in \mathbb{R}^n : (x-x')^T X (x-x') \leq 1 \}

    is an ellipsoid.
    Moreover, the volume of \mathcal{E}_X is proportional to (\det X^{-1})^{1/2}.
    (This follows change of variable formula.)
    Justification WLOG: X = \text{diag}(v) for some v \in \mathbb{R}^n whose entries are the positive eigenvalues of X .
    Then (x-x_0)^TX(x-x_0) \leq 1 implies

    \begin{aligned}  (x-x')^TX(x-x') &= \begin{bmatrix} x_1-x'_1 & \cdots & x_n - x_n' \end{bmatrix} \begin{bmatrix}v_1 & 0 & \cdots & 0\\ 0 & v_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\0&0&\cdots&v_n \end{bmatrix}\begin{bmatrix} x_1-x'_1 \\ \vdots \\ x_n - x_n' \end{bmatrix}\\ &=v_1(x_1-x_1')^2 + \cdots + v_n(x_n-x'_n)^2\\ &\leq 1 \end{aligned}

    which is the usual description of a closed ellipsoid with center x' .
    E.g., X = r^{-2}\text{diag}(1,1,\cdots,1) gives the ball of radius r .






    Problem reformulation. Find those X \in \boldsymbol{S}_{++}^n satisfying a_i^T X a_i \leq 1 which minimize (\det X^{-1})^{1/2}.
    In fact, f_0(X) = \log \det X^{-1} is convex, and so we may reformulate the problem as the following (COP):

    \begin{cases} \text{minimize} & f_0(X) = \log \det X^{-1}\\ &\text{dom} f_0 = \boldsymbol{S}_{++}^n\\ \text{subject to} & a_i^T X a_i \leq 1. \end{cases}







    Recall
    1. \text{trace}(ABC) = \text{trace}(CAB) for A,B,C \in \mathbb{R}^{n \times n};
    2. there is a natural way of identifying matrix A \in \mathbb{R}^{n \times n} with vector v_A \in \mathbb{R}^{n^2} ;
    3. Under this identification, we have

      \text{trace}(A^T B) = v_A^T v_B .







    Let A_i = a_ia_i^T , noting A_i^T = A_i .
    Then 1. gives

    \begin{aligned}  a_i^TXa_i &= \text{trace}(a_i^TXa_i)\\ &= \text{trace}(a_ia_i^TX)\\ &= \text{trace}(A_iX). \end{aligned}

    Therefore 2. and 3. allow us to realize the quadratic inequality

    a_i^TXa_i \leq 1

    as the linear inequality

    \text{trace}(A_i X) = v_{A_i}^T v_X \leq 1 .

    These observations allow us to appeal to the Lagrange dual formalism.





    It can be shown

    \begin{aligned} f_0^*(Y) &= \log\det(-Y)^{-1} - n\\ \text{dom}f_0^* &= - \boldsymbol{S}_{++}^n. \end{aligned}

    From preceding section, we can conclude the Lagrange dual function of f_0(X) = \log\det X^{-1} is

    g(\lambda) = \begin{cases} \log\det\left(\sum_{i=1}^m \lambda_i a_ia_i^T \right) - \boldsymbol{1}_m^T \lambda + n & \sum_{i=1}^m \lambda_i a_i a_i^T \succ 0\\ -\infty & \text{ else} \end{cases}.







    Since g(\lambda) provides a lower bound of the optimal value, we conclude: if V_0 is the optimal volume, then, up to known constant c_0>0 , there holds

    c_0V_0 \geq \log\det\left(\sum_{i=1}^m \lambda_i a_i a_i^T \right) - \boldsymbol{1}^T\lambda + n ,

    which is a very explicit lower bound depending only on the Lagrange multiplier and the problem data.





    Let

    \mathcal{A} = \begin{bmatrix} v_{A_1}^T\\\vdots\\ v_{A_m}^T \end{bmatrix}, \quad \boldsymbol{1}_n = \begin{bmatrix} 1\\\vdots\\1\end{bmatrix} \in \mathbb{R}^m .

    Therefore, the problem is equivalent to

    \begin{cases} \text{minimize} & f_0(X) = \log \det X^{-1}\\ &\text{dom} f_0 = \boldsymbol{S}_{++}^n\\ \text{subject to} & \mathcal{A} v_X \leq \boldsymbol{1}_m. \end{cases}

    Introducing the Lagrange multiplier \lambda , observe

    \mathcal{A}^T\lambda = \lambda_1 v_{A_1} + \cdots + \lambda_m v_{A_m} .

    Under our chosen identification \mathbb{R}^{n \times n} \cong \mathbb{R}^{n^2} , we identify

    \lambda_i v_{A_i} \iff \lambda_i A_i = \lambda_i a_ia_i^T

    and so

    \mathcal{A}^T\lambda \iff \sum_{i=1}^m \lambda_i a_ia_i^T.

    We lastly record the conjugate function of f_0(X) = \log \det X^{-1} :

    \begin{aligned} f_0^*(Y) &= \log\det(-Y)^{-1} - n\\ \text{dom}\,f_0^* &= -\boldsymbol{S}_{++}^n \end{aligned} .

    By previous section, the Lagrange dual is given by

    \begin{aligned} g(\lambda) &= - f_0^*(-\mathcal{A}^T\lambda) - \lambda^T\boldsymbol{1}_m\\  &=  \begin{cases} \log\det(\sum_{i=1}^m \lambda_i a_i a_i^T ) - \boldsymbol{1}_m + n & \sum_{i=1}^m \lambda_i a_i a_i^T \succ0 \\ -\infty &\text{else} \end{cases} \end{aligned} .

    Since g(\lambda) provides a lower bound of the optimal value, we conclude: if V_0 is the optimal volume, then, up to known constant c_0>0 , there holds

    c_0V_0 \geq \log\det\left(\sum_{i=1}^m \lambda_i a_i a_i^T \right) - \boldsymbol{1}^T\lambda + n ,

    which is a very explicit lower bound depending only on the Lagrange multiplier and the problem data.
    The Dual Problem Let

    \begin{aligned} g(\lambda,\nu) &=\text{ Lagrange dual function of a given (OP)}\\ p^\star &= \text{ optimal value of the (OP)}. \end{aligned}

    Recall

    \lambda \succeq 0 \implies g(\lambda,\nu) \leq p^\star



    Main point:

    \sup\{ g(\lambda,\nu):\lambda \succeq0,\nu \in \mathbb{R}^p\} \leq p^\star ,

    suggests considering maximization problem with objective g(\lambda,\nu) .
    Gives best underestimate available by Lagrange dual function.





    Lagrange dual problem: the problem

    \begin{cases} \text{maximize} & g(\lambda,\nu)\\ \text{subject to} & \lambda \succeq 0 \end{cases}.







    Remarks
    1. The original problem is called the primal problem.
      Viz.., the dual problem is dual to the primal problem.

    2. (\lambda,\nu) feasible to dual problem \implies g(\lambda,\nu)>-\infty.
      Viz., (\lambda,\nu) is dual feasible.

    3. As stated, the only constraint is \lambda \succeq 0 ; however, domain of g usually has implicit constraints.

    4. Generally: \text{dom}\,g has “dimension” less than or equal to m+p .

    5. Recall: g is infimum of family of concave functions.
      Thus, g is concave and -g is convex.
      So,

      maximizing concave g \iff minimizing convex -g .


      Therefore, since \lambda \succeq0 is convex constraint:

      Dual problems are always convex, even if primal is not.



    6. Solutions (\lambda^\star,\nu^\star) to dual are called dual optimal.
















    Remark on Duality for Equivalent Problems Question: if two primal problems are equivalent, how are their respective duals related?





    Spoiler: The respective dual problems may be quite different; this is demonstrated by example.





    Example. Consider the unconstrained problem

    \text{(OP1)} \begin{cases} \text{minimize} & f_0(Ax+b). \end{cases}

    This problem is equivalent to the constrained problem

    \text{(OP2)} \begin{cases} \text{minimize} & f_0(y)\\ \text{subject to} & y = Ax+b \end{cases}.







    Having no constraints, the Lagrangian for (OP1) is

    L(x) = f_0(Ax+b)

    and so the Lagrange dual function is simply

    g = \inf\{f_0(Ax+b)\}= p^\star .

    Therefore, the dual problem of (OP1) trivializes to minimizing a constant.





    Having only the equality constraints y = Ax+b , the Lagrangian for (OP2) is

    L(x,y,\nu) = f_0(y) + \nu^T(Ax+b-y) .

    Observe that L(x,y,\nu) is unbounded below if A^T\nu \neq 0 .
    Moreover, if A^T\nu=0, then

    \begin{aligned}  g(\nu) &= \inf\{ f_0(y) + \nu^T b - \nu^T y \}\\ &=\nu^T b - \sup \{ \nu^Ty - f_0(y)\}\\ &= \nu^T b - f_0^*(\nu). \end{aligned}

    Thus,

    g(\nu) = \begin{cases} \nu^Tb - f_0^*(\nu) & A^T\nu=0\\ -\infty & \text{else}. \end{cases}

    Therefore, the dual problem to (OP2) is

    \begin{cases} \text{maximize} &b^T\nu - f_0^*(\nu)\\ \text{subject to }& A^T\nu = 0. \end{cases}





    Conclusion: the dual of (OP2) is conceivable useful, whereas the dual of (OP1) is useless, even though (OP1) and (OP2) are equivalent.














    Example: Duality of standard form and inequality form LP Recall: a standard form LP is of the form

    \text{(LP1)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax = b\\ & x \succeq 0 \end{cases} .

    An inequality form LP is of the form

    \text{(LP2)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax \preceq b \end{cases} .

    We will show the dual of (LP1) is (equivalent to a problem) of the form (LP2), and vice versa.





    The dual (LP1)
    We consider first:

    \text{(LP1)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax = b\\ & x \succeq 0 \end{cases} .

    For (LP1), the Lagrangian is

    L(x,\lambda,\nu) = (c + A^T\nu - \lambda)^Tx - b^T\nu

    and Langrange dual function is

    \begin{aligned} g(\lambda,\nu) &= \begin{cases} -b^T\nu & c - \lambda + A^T\nu = 0\\ -\infty & \text{else} \end{cases}\\ \text{dom}\, g &= \{(\lambda,\nu) \in  \mathbb{R}^n \times \mathbb{R}^p : c - \lambda + A^T\nu = 0\} \end{aligned} .

    (Recall: domain is determined by where L is bounded below.)





    Therefore, the dual problem of (LP1) is

    \begin{cases} \text{maximize}&g(\lambda,\nu)= \begin{cases} -b^T\nu & c - \lambda + A^T\nu = 0\\ -\infty & \text{else} \end{cases}\\ \text{subject to}&\lambda \succeq 0 \end{cases},

    which is evidently equivalent to

    \begin{cases} \text{maximize}& -b^T\nu \\ \text{subject to}&\lambda \succeq 0\\ & c - \lambda + A^T\nu = 0 \end{cases},

    N.B.: the domain of g had the implicit constraint c - \lambda + A^T\nu = 0 .





    Observe the equivalency of constraints:

    \begin{aligned} \begin{cases} &\lambda \succeq 0\\ & c - \lambda + A^T\nu = 0 \end{cases} & \iff \begin{cases} &\lambda \succeq 0\\ & c+ A^T\nu = \lambda \end{cases}\\ & \iff  \begin{cases} & c+ A^T\nu \succeq 0 \end{cases}\\ & \iff  \begin{cases} & c \succeq -A^T\nu \end{cases} \end{aligned}







    Therefore

    \begin{cases} \text{maximize}& -b^T\nu \\ \text{subject to}&\lambda \succeq 0\\ & c - \lambda + A^T\nu = 0 \end{cases} \iff \begin{cases} \text{minimize}& b^T\nu \\ \text{subject to}&  - A^T\nu \preceq c \end{cases}

    The last problem is of the form (LP2).
    Viz., the dual of a standard form LP is (equivalent to) an inequality form LP.







    The dual of (LP2)
    We now consider

    \text{(LP2)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax \preceq b. \end{cases} .

    For (LP2), the Lagrangian is

    \begin{aligned} L(x,\lambda) &= c^Tx + \lambda^T(Ax-b) \\ &= (A^T\lambda + c)^Tx - b^T\lambda  \end{aligned} .

    N.B.: an affine function \alpha^Tx + \beta is bounded below iff \alpha = 0 .
    Therefore, the Lagrange dual function is

    \begin{aligned} g(\lambda) &= \begin{cases} -b^T\lambda & A^T\lambda + c = 0\\ -\infty & \text{else} \end{cases}\\ \text{dom}\,g &= \{\lambda \in \mathbb{R}^m : A^T\lambda + c = 0 \} \end{aligned} .





    Therefore, the dual problem of (LP2) is

    \begin{cases} \text{maximize}&g(\lambda,\nu)= \begin{cases} -b^T\nu & A^T\lambda + c = 0\\ -\infty & \text{else} \end{cases}\\ \text{subject to}&\lambda \succeq 0 \end{cases},

    which is evidently equivalent to

    \begin{cases} \text{maximize}& -b^T\nu \\ \text{subject to}& A^T\lambda + c = 0\\ & \lambda \succeq 0 \end{cases}.

    Again, the domain of g had an implicit constraint, namely, A^T\lambda + c = 0 .





    Observe the equivalency of constraints:

    \begin{cases} &A^T\lambda + c = 0\\ & \lambda \succeq 0 \end{cases} \iff \begin{cases} &A^T\lambda = -c\\ & \lambda \succeq 0 \end{cases}





    Therefore, the dual problem is equivalent to

    \begin{cases} \text{minimize}& b^T\nu \\ \text{subject to}& A^T\lambda =-c\\ &\lambda \succeq 0 \end{cases},

    which is a problem of the form (LP1).
    Viz., the dual of an inequality form LP is a standard form LP.














    Weak and Strong Duality Let

    \begin{aligned} p^\star &= \text{ optimal value of primal problem}\\ d^\star &= \text{ optimal value of dual problem} \end{aligned}.

    Weak duality: the property d^\star \leq p^\star .
    N.B.: Optimization problems of the form (OP) always satisfy weak duality.
    Strong duality: the property d^\star = p^\star .
    N.B.: Having strong duality does not mean the primal and the dual are actually solvable.
    Constraint qualifications: conditions for a given type of problem which ensure strong duality.
    E.g., “A (QCQP) with single quadratic constraint has strong duality if _______.”
    Optimal duality gap: the difference p^\star - d^\star .






    Remarks
    1. Observe

      \begin{aligned} p^\star = -\infty &\iff \text{ primal unbounded}\\ p^\star = +\infty & \iff \text{ primal infeasible}\\ d^\star = + \infty &\iff \text{ dual unbounded}\\ d^\star = -\infty &\iff \text{ dual infeasible}. \end{aligned}

    2. Therefore

      \begin{aligned}  p^\star = - \infty &\implies d^\star = -\infty \implies \text{ dual is infeasible}\\ d^\star = +\infty &\implies p^\star = +\infty \implies \text{ primal is infeasible}. \end{aligned}

    3. 0<p^\star-d^\star<\infty is possible.
    4. Primal and dual may be simultaneously infeasible, i.e.,

      \begin{aligned} p^\star &= + \infty\\ d^\star &=-\infty \end{aligned}

      may occur.
    5. Convex optimization problems often have strong duality, but not always.















    Slater’s Condition Consider the (COP)

    \text{(COP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}&f_i(x) \leq 0, \quad i=1,\ldots,m\\ &Ax = b \end{cases}

    with domain D.
    Let \text{relint}\,D denote the relative interior of D .
    Intuitively: x\in \text{relint}\,D means x \in D is not on the “boundary” of D
    Recall: D generally has smaller dimension than number of variables.



    Slater’s condition: there exists x \in \text{relint}\,D such that

    f_i(x)<0, \quad i=1,\ldots,m and Ax = b .

    Strict feasibility: when a feasible x satisfies Slater’s condition.



    Example Consider the inequality constraints

    \begin{cases} &(x_1-1)^2 + x_2^2 \leq 1\\ &(x_1-2)^2 + x_2^2 \leq 4\\ &(x_1-3)^2+x_2^2 \leq 9 \end{cases}

    and suppose f_0 has

    \text{dom}\,f_0 = \{ (x_1-3)^2+x_2^2 \leq 9 \} .

    Then (x_1,x_2)=(0,0) is feasible but not strictly feasible.
    Moreover, any point in the interior of the smallest disk satisfies Slater’s condition.





    Slater’s Theorem. If (COP) satisfies Slater’s condition, then it is strongly dual and the dual problem is solvable

    Remarks.
    1. Slater’s condition is a constraint qualification for convex optimization problems.
    2. In principle, an (OP) may be strongly dual without the dual being solvable.
      Thus, Slater’s theorem has the strength of implying there is a dual feasible (\lambda^\star,\nu^\star) with g(\lambda^\star,\nu^\star) = d^\star = p^\star .






    Theorem. If Slater’s condition holds for all non-affine inequality constraints, then conclusion of Slater’s theorem hold.

    Remarks.
    1. Thus, if f_1,\ldots,f_k are affine and f_{k+1},\ldots,f_{m} are not, then it is enough if there holds the weakened Slater condition: there exists a x \in \text{relint}\,D such that

      \begin{aligned} f_i(x) &\leq 0, \quad i =1,\ldots,k\\  f_i(x)&<0,\quad i=k+1,\ldots,m\\  Ax&=b  \end{aligned} .

    2. Therefore, Slater’s condition is really a qualification constraint for non-affine inequality constraints.















    Remark on Slater’s Condition Consider the convex optimization problem

    \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0 , \quad i=1,\ldots,m\\ & Ax = b \end{cases}.

    Let D \subset \mathbb{R}^n be the domain.
    Since D is convex, it lies in its affine hull \text{aff}\,(D) .
    Given x\in \mathbb{R}^n, r>0 ,

    B(x,r) = \{ y \in \mathbb{R}^n : |x-y|<r \}

    be the Euclidean ball of radius r and center x .
    Relative interior: the set

    \begin{aligned}  \text{relint}(D) = \{ x \in D: \exists r>0 \text{ such that } B(x,r) \cap \text{aff}\,(D) \subset D \}. \end{aligned}





    Example. In the image below:
    • D is the ellipse lying in the xy -plane.
    • The affine hull \text{aff}\,(D) is the xy -plane.
    • The ball depicts a ball B centered at a point in D and with small enough radius so that B \cap \text{aff}\,(D) still lies in D .
    • The relative interior \text{relint}(D) is the shaded region of the ellipse excluding the curve bounding the domain.







    Question How can Slater’s condition fail?
    I.e., what if there exist no x \in \text{relint}(D) such that

    \begin{aligned} f_i(x) &< 0, \quad i=1,\ldots,m\\ Ax&=b? \end{aligned}







    Consider the (COP)

    \text{(COP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_1(x) \leq 0 , \quad i=1,\ldots,m\\ & Ax = b \end{cases}.

    where f_0,f_1: \mathbb{R}^3 \to \mathbb{R} are convex.
    Suppose
    • f_1 \leq 0 describes a cube:

      \begin{aligned}  \{f_1(x_1,x_2,x_3) \leq 0 \} = \{0 \leq x_1, x_2, x_3 \leq 1 \} = : C,  \end{aligned}

    • \text{dom}\,f_0 = \text{dom}\,f_1 = \{ x_3 \leq 1 \} ,
    • the solution set

      \{Ax = b \} = \{ x_3 = 1 \}

      is a plane intersecting the top face of the cube.






    The images below depict the cube, the plane and their intersection (a square).
    N.B.: the domain is the area below and including the plane.







    Then the domain is D = \{x_3 \leq 1 \} and (COP) fails Slater’s condition:

    if x \in \text{relint}(D) = \{x_3<1\} , then Ax = b must fail.


    Fix: Can project problem onto a lower dimensional face of C .
    Indeed, since

    C \cap \{ Ax=b\}=\{0\leq x_1,x_2 \leq 1, x_3 = 1 \} ,

    (COP) is equivalent to the problem

    \text{(COP1)} \begin{cases} \text{minimize} & F_0(x_1,x_2)\\ \text{subject to} & -x_1 \leq 0\\ &x_1 -1 \leq 0\\ &-x_2 \leq 0\\ &x_2 -1 \leq 0  \end{cases},

    where F_0(x_1,x_2) = f_0(x_1,x_2,1) .
    Taking x_1=x_2=\frac{1}{2} , we see (COP1) satisfies Slater’s condition.





    Comparing Duals and KKT Conditions.
    Let L_0 and L_1 be the Lagrangian for (COP) and (COP1) respectively.
    Then

    \begin{aligned} L_0(x,\lambda,\nu) &= f_0(x) + \lambda f_1(x) + \nu(x_3-1)\\ x &= \begin{bmatrix} x_1 &x_2&x_3 \end{bmatrix}^T\\ L_1(x',\lambda') &= F_0(x_1,x_2) -\lambda_1 x_1 + \lambda_2(x_1-1) - \lambda_3 x_2 + \lambda_4 (x_2-1)\\ x' &= \begin{bmatrix} x_1 &x_2 \end{bmatrix}^T\\ \lambda' &= \begin{bmatrix} \lambda_1 & \lambda_2 & \lambda_3 & \lambda_4 \end{bmatrix}^T. \end{aligned}

    The respective KKT conditions are

    \begin{cases} \begin{aligned} f_1(x) & \leq 0\\ x_3-1&=0\\ \lambda f_1(x) &=0 \\ \lambda &\geq 0\\ \nabla f_0 + \lambda \nabla f_1 + \begin{bmatrix}0\\0\\\nu\end{bmatrix} &=0 \end{aligned} \end{cases} \quad \begin{cases} \begin{aligned} 0 \leq x_1 &\leq 1\\ 0 \leq x_2 &\leq 1\\ \lambda_1 x_1 = \lambda_2 (x_1-1)&=0\\ \lambda_3 x_2 = \lambda_4 (x_2-1)&=0\\ \lambda_1,\lambda_2,\lambda_3,\lambda_4 &\geq 0\\ \nabla F_0 + \begin{bmatrix} \lambda_2-\lambda_1\\\lambda_4 - \lambda_3 \end{bmatrix} &=0 \end{aligned} \end{cases}







    Remarks.
    1. While (COP) does not satisfy Slater’s conditions, its projection (COP1) does.
    2. f_1 might not even be differentiable, in which the KKT conditions for (COP) would be ill-posed.
    3. Identifying the correct face to project problem to \implies relatively simpler KKT conditions.
      (Not true in general.)






    Geometric description of Slater’s condition failing:
    • Consider a general (COP) with convex domain D .
    • Suppose the relative boundary of D contains a convex set K (e.g., a polygonal face or D is conic)
    • If \text{aff}(K) = \{Ax=b\} or \{Ax=b\} \cap D = K , then the problem fails Slater’s condition.
    • Indeed, any feasible point must satisfy Ax=b and lie on the relative boundary (and hence not in the relative interior).
    • N.B.: a problem even “almost” failing Slater’s condition can cause numerical issues.
      E.g., a 3-dimensional problem with nearly 2-dimensional domain.















    Examples Example 1.
    Consider the least squares problem

    \text{(LS)} \begin{cases} \text{minimize}&x^Tx\\ \text{subject to}& Ax = b \end{cases}.

    Recall: the Lagrangian is

    L(x,\nu) = x^Tx + \nu^T(Ax-b)

    and Lagrange dual function is

    g(\nu) = -\frac{1}{4}\nu^TAA^T\nu - \nu^Tb .

    Therefore, the dual problem is

    \begin{cases} \text{maximize}& -\frac{1}{4}\nu^TAA^T\nu - \nu^Tb \end{cases}.







    N.B.: (LP) has no inequality constraints and D = \mathbb{R}^n .
    Thus, Slater’s condition is simply:

    there exists x \in \mathbb{R}^n such that Ax=b, i.e., b \in \text{range}\,A.

    We will analyze this closer.





    Case 1 (b \in \text{range}\,A ):
    Here, Slater’s condition is satisfied since b \in \text{range}\,A implies there exists x with Ax=b.
    Therefore, primal is feasible and hence p^\star < +\infty .
    Slater’s theorem implies

    d^\star = p^\star < +\infty .

    In particular, the dual objective

    -\frac{1}{4}\nu^TAA^T\nu - \nu^Tb

    is bounded above.





    Case 2 (b \notin \text{range}\,A ):
    Here, the primal is infeasible and so p^\star = +\infty .
    Note b \notin\text{range}\,A \implies exists z such that A^Tz = 0 and b^Tz \neq 0 .
    (Recall: \text{ker}\,A^T \perp \text{range}\,A.)
    But then

    g(tz)=-\frac{1}{4}tz^TAAtz - b^Ttz = -b^Ttz ,

    which is unbounded above as a function of t, and so d^\star=+\infty .






    In conclusion: for (LS), there holds d^\star = p^\star , even when p^\star = \infty .
    Therefore, (LS) is strongly dual whether feasible or infeasible.







    Example 2.
    Consider the standard form (LP)

    \text{(LP)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax = b\\ & x \succeq 0 \end{cases}.

    We have already shown that its dual problem is

    \begin{cases} \text{minimize}& b^T\nu \\ \text{subject to}&  - A^T\nu \preceq c \end{cases}.

    Since the inequality constraints of (LP) are affine, namely,

    x_i \geq 0

    weakened version of Slater’s condition implies problem is strongly dual when feasible.
    Interestingly: (LP) may fail to be strongly dual when infeasible, i.e., there may hold p^\star = +\infty and d^\star = -\infty.







    Example 3.
    Consider the QCQP

    \text{(QCQP)} \begin{cases} \text{minimize} & \frac{1}{2}x^TQ_0x + q_0^Tx + r_0\\ \text{subject to}& \frac{1}{2}x^TQ_ix + q_i^Tx + r_i \leq 0, \quad i=1,\ldots,m \end{cases}

    where

    Q_0 \in \boldsymbol{S}_{++}^n, \quad Q_i \in \boldsymbol{S}_+^n,\quad i=1, \ldots, m.

    We now determine the dual problem.





    The Lagrangian of (QCQP) is

    \begin{aligned} L(x,\lambda) &= \frac{1}{2}x^TQ_0x + q_0^Tx + r_0\\ &+\sum_{i=1}^m \frac{1}{2}\lambda_i x^TQ_ix + \lambda_i q_i^Tx + \lambda_i r_i\\ &=  \frac{1}{2}x^T\left(Q_0+\sum_{i=1}^m \lambda_i Q_i \right)x \\ &+  \left(q_0^T + \sum_{i=1}^m \lambda_i q_i^T \right)x \\ &+  r_0 + \sum_{i=1}^m \lambda_i r_i  \end{aligned}







    Defining

    \begin{aligned} Q(\lambda) &= Q_0 + \sum_{i=1}^m \lambda_i Q_i\\ q(\lambda) &= q_0 + \sum_{i=1}^m \lambda_i q_i\\ r(\lambda) &= r_0 + \sum_{i=1}^m \lambda_i r_i, \end{aligned}

    we have

    L(x,\lambda) = \frac{1}{2}x^TQ(\lambda)x + q(\lambda)^Tx + r(\lambda) .







    We now compute the Lagrange dual function g(\lambda) for \lambda \succeq 0 .

    To begin, observe: if \lambda \succeq 0 , then

    Q(\lambda) = Q_0 + \sum_{i=1}^m \lambda_i Q_i \succeq 0

    due to positive semidefiniteness of Q_0 .
    So: Q(\lambda) is invertible and

    L(x,\lambda) = \frac{1}{2}x^TQ(\lambda)x + q(\lambda)^Tx + r(\lambda) .

    is convex in x.
    Therefore, g(\lambda) is determined by critical points of L(x,\lambda).





    Compute

    \begin{aligned} \nabla_x L(x,\lambda) &= \nabla_x \left(\frac{1}{2}x^TQ(\lambda)x\right) + \nabla_x\left( q(\lambda)^Tx \right) + \nabla_x r(\lambda)\\ &=Q(\lambda)x + q(\lambda). \end{aligned}

    Thus, x^\star is a minimizer of L(x,\lambda) iff

    \nabla_x L(x^\star,\lambda)=0 \iff x^\star = -Q(\lambda)^{-1}q(\lambda) .





    Therefore,

    \begin{aligned} g(\lambda) &= \inf\{L(x,\lambda): x \in \mathbb{R}^n\}\\ &=\min\{L(x,\lambda):x \in \mathbb{R}^n \}\\ &=L(x^\star,\lambda)\\ &=\frac{1}{2}(-Q(\lambda)^{-1}q(\lambda))^TQ(\lambda)(-Q(\lambda)^{-1}q(\lambda)) \\ &+ q(\lambda)^T(-Q(\lambda)^{-1}q(\lambda)) + r(\lambda)\\ &=-\frac{1}{2}q(\lambda)^TQ(\lambda)^{-1}q(\lambda) + r(\lambda). \end{aligned}





    Therefore, the primal problem

    \text{(QCQP)} \begin{cases} \text{minimize} & \frac{1}{2}x^TQ_0x + q_0^Tx + r_0\\ \text{subject to}& \frac{1}{2}x^TQ_ix + q_i^Tx + r_i \leq 0, \quad i=1,\ldots,m \end{cases}

    has dual problem

    \begin{cases} \text{maximize}&-\frac{1}{2}q(\lambda)^TQ(\lambda)^{-1}q(\lambda) + r(\lambda)\\ \text{subjet to} & \lambda \succeq0 \end{cases}.

    Slater’s theorem implies that these two problems are strongly dual if there exists a strictly feasible x satisfying

    \frac{1}{2}x^TQ_ix + q_i^Tx + r_i < 0

    for all i =1,\ldots,m.














    Qualitative Uses of Lagrange Duality Recall: a problem has

    \begin{aligned} \text{weak duality when }& d^\star \leq p^\star;\\ \text{strong duality when }&d^\star = p^\star. \end{aligned}

    These are qualitative properties; e.g., strong duality alone does not provide means to find primal optimal x^\star satisfying f_0(x^\star) = p^\star ;






    However, strong and weak duality provide three useful “qualitative” uses:

    1. Certification: a dual feasible (\lambda,\nu) provides a certification that g(\lambda,\nu) is a suboptimal value: g(\lambda,\nu) \leq p^\star .
      Strong duality \implies can (theoretically) certify up to any desirable precision.





      Duality gap: for primal feasible x and dual feasible (\lambda,\nu) , the value

      f_0(x) - g(\lambda,\nu) .

      N.B.: if x,(\lambda,\nu) feasible, then

      g(\lambda,\nu) \leq d^\star \leq p^\star \leq f_0(x),

      i.e.,

      p^\star,d^\star \in [g(\lambda,\nu),f_0(x)]

      and the duality gap gives length of interval.



      In particular: if duality gap = 0 at a feasible pair x,(\lambda,\nu) , then

      \begin{aligned} g(\lambda,\nu) \leq d^\star &\leq p^\star \leq f_0(x)\\  g(\lambda,\nu) &= f_0(x)  \end{aligned}

      give

      p^\star = f_0(x) = g(\lambda,\nu) = d^\star ,

      i.e., such (\lambda,\nu) certifies that x is optimal, and vice versa.





    2. Stopping Criterion. Observing

      \begin{aligned}  g(\lambda,\nu) \leq p^\star & \iff  -p^\star \leq - g(\lambda,\nu)\\ &\iff f_0(x) - p^\star \leq f_0(x) - g(\lambda,\nu), \end{aligned}

      and setting

      \begin{aligned} \epsilon =  f_0(x) - g(\lambda,\nu)  \end{aligned}

      we see

      \begin{aligned} (\lambda,\nu) \text{ is dual feasible} \ \implies \text{primal feasible }x \text{ is }\epsilon\text{-suboptimal} \end{aligned}.

      Viz.,

      f_0(x) - p^\star \leq \epsilon .

      N.B.: this is showing x is \epsilon -suboptimal without even knowing the primal optimal p^\star .



      As application: suppose we wish to find optimal x^\star but can settle for feasible x' with f_0(x') at worst \epsilon -suboptimal:

      f_0(x') - p^\star \leq \epsilon .

      Suppose we use algorithm producing feasible x^{(k)},(\lambda^{(k)},\nu^{(k)}) in search of optimal x^\star .
      Letting

      \epsilon_k =f_0(x^{(k)}) - g(\lambda^{(k)},\nu^{(k)})

      denote the resulting duality gaps, we may use the following stopping criterion:

      \text{If } \epsilon_k \leq \epsilon \text{, then stop search} .

      Therefore, K with \epsilon_K \leq \epsilon gives feasible x'=x_K within allowed error: f(x') - p^\star \leq \epsilon .



      N.B.:
      1. Re-emphasize: this stopping criterion does not require knowing the primal optimal p^\star in advance.
      2. strong duality \implies \epsilon can be arbitrarily small.






    3. Complementary slackness. Assume problem has strong duality, i.e., p^\star = d^\star .
      If x^\star is primal optimal and (\lambda^\star,\nu^\star) is dual optimal, then

      \lambda_i^\star f_i(x^\star) = 0, \quad i=1,\ldots,m.

      For example:

      \begin{aligned} \begin{bmatrix} f^T(x^\star)\\ \hline \lambda^{\star T} \end{bmatrix} = \begin{bmatrix} f_1(x^\star) & 0 & 0 & f_4(x^\star) & \cdots & 0 & f_{m-1}(x^\star) & 0\\ \hline 0 & \lambda_2^\star & \lambda_3^\star & 0 & \cdots & \lambda_{m-2}^\star & 0 & \lambda_m^\star \end{bmatrix}, \end{aligned}

      where f is the vector of inequality constraint functions.
      This relationship between the two vectors is called complementary slackness.
      N.B.:
      1. Since \lambda^\star \succeq 0 and f_i(x^\star)\leq0 , we have

        \begin{aligned} \lambda_i^\star > 0 & \implies f_i(x^\star) = 0\\ f_i(x^\star) < 0 & \implies \lambda_i^\star = 0. \end{aligned}

      2. While a qualitative property, complementary slackness can sometimes be used to solve the primal.
      3. Having \lambda_i^\star = f_i(x^\star)= 0 is permissible.






      Justification: for x^\star primal optimal and (\lambda^\star,\nu^\star) dual optimal, we find

      \begin{aligned} f_0(x^\star) &= g(\lambda^\star,\nu^\star)\\ &= \inf\{ f_0(x) + \sum_{i=1}^m \lambda_i^\star f_i(x) + \sum_{i=1}^p \nu_i^\star h_i(x) : x \text{ feasible} \}\\ &\leq f_0(x^\star) + \sum_{i=1}^m \lambda_i^\star f_i(x^\star) + \sum_{i=1}^p \nu_i^\star h_i(x^\star)\\ &\leq f_0(x^\star). \end{aligned}

      Since

      a\leq b \leq c \leq a \implies a=b=c ,

      and since h_i(x^\star) = 0, we conclude

      \begin{aligned}  f_0(x^\star) = f(x^\star) + \sum_{i=1}^m \lambda_i^\star f_i(x^\star)  \end{aligned}

      and so

      \begin{aligned}  \sum_{i=1}^m \lambda_i^\star f_i(x^\star) = 0. \end{aligned}







      Since \lambda^\star \succeq0 and f_i(x^\star) \leq 0 , the sum

      \begin{aligned}  \sum_{i=1}^m \lambda_i^\star f_i(x^\star)  \end{aligned}

      is a sum of nonpositive things and so

      \begin{aligned}  \sum_{i=1}^m \lambda_i^\star f_i(x^\star) = 0 \end{aligned}

      implies

      \begin{aligned} \lambda_i^\star f_i(x^\star) = 0, \quad i=1,\ldots,m, \end{aligned}

      which is the desired complementary slackness.















    Karush-Kuhn-Tucker (KKT) Conditions Assume f_i,h_i are differentiable and have open domains.

    In previous section, we saw: if x^\star,(\lambda^\star,\nu^\star) are optimal with zero optimality gap, then

    \begin{aligned} \inf\{ f_0(x) + &\sum_{i=1}^m \lambda_i^\star f_i(x) + \sum_{i=1}^p \nu_i^\star h_i(x) : x \text{ feasible} \} \\ &= f_0(x^\star) + \sum_{i=1}^m \lambda_i^\star f_i(x^\star) + \sum_{i=1}^p \nu_i^\star h_i(x^\star). \end{aligned}



    Question: What does this say about relationship between L(x,\lambda,\nu), x^\star and (\lambda^\star,\nu^\star) (under strong duality)?





    Above is equivalent to:

    \inf\{L(x,\lambda^\star,\nu^\star):x \text{ feasible}\} = L(x^\star,\lambda^\star,\nu^\star), .

    or

    x^\star \in \text{argmin}\{L(x,\lambda^\star,\nu^\star): x \text{ feasible} \} .

    Viz., if
    • problem is strongly dual,
    • x^\star is primal optimal, and
    • (\lambda^\star,\nu^\star) is dual optimal,
    then x^\star minimizes the Lagrangian L(\cdot,\lambda^\star,\nu^\star) with dual optimal Lagrange multipliers.





    But, if
    • x^\star minimizes L(\cdot,\lambda^\star,\nu^\star) and
    • f_i,h_i are differentiable,
    then x \mapsto L(x,\lambda^\star,\nu^\star) is differentiable and x^\star is a critical point:

    \begin{aligned} \nabla_xL(x^\star,\lambda^\star,\nu^\star) &= \nabla f_0(x^\star) + \sum_{i=1}^m\lambda^\star_i \nabla f_i(x^\star) + \sum_{i=1}^p \nu^\star_i \nabla h_i(x^\star)\\ &=0 \end{aligned}

    Beware: we do not know a priori if \nabla f_0(x^\star) or \nabla h_i (x^\star) are zero.





    Karush-Kuhn-Tucker (KKT) Optimality Conditions:

    \begin{aligned} f_i(x^\star) & \leq 0 , \quad i=1,\ldots,m\\ h_i(x^\star) & = 0 , \quad i=1,\ldots,p\\ \lambda_i^\star f_i(x^\star) &= 0, \quad i =1,\ldots,m\\ \lambda^\star &\succeq 0\\ \nabla_x L(x^\star,\lambda^\star,\nu^\star) &=0 \end{aligned}

    Recap:
    • these are necessary conditions for any strongly dual optimization problem admitting primal optimal x^\star and dual optimal (\lambda^\star,\nu^\star);
    • the first and second conditions just indicate x^\star is primal feasible;
    • the third condition is the complementary slackness derived in previous section;
    • the fourth condition is standard nonnegativity of Lagrange multiplier \lambda;
    • the last condition follows from x^\star minimizing L(\cdot,\lambda^\star,\nu^\star).















    KKT and Convexity Theorem. If the primal problem is differentiable and convex, then the KKT conditions are sufficient for primal and dual optimality and strong duality.

    Viz., for differentiable convex problems, if x' and (\lambda',\nu') satisfy the KKT conditions, then they automatically primal and dual optimal, respectively.
    Proof. Step 1. Suppose x' and (\lambda',\nu') satisfy the KKT conditions:

    \begin{aligned} f_i(x') & \leq 0 , \quad i=1,\ldots,m\\ h_i(x') & = 0 , \quad i=1,\ldots,p\\ \lambda_i' f_i(x') &= 0, \quad i =1,\ldots,m\\ \lambda' &\succeq 0\\ \nabla_x L(x',\lambda',\nu') &=0 \end{aligned}.

    First two conditions \implies x' is primal feasible.





    Step 2. Observe

    \lambda' \succeq 0 \implies \lambda'_if_i(x) are convex.

    Thus

    \begin{aligned} L(x,\lambda',\nu') = f_0(x) + \sum_{i=1}^m \lambda_i' f_i(x) + \sum_{i=1}^p \nu_i' h_i(x) \end{aligned} .

    as a function of x is a sum of convex functions and hence convex.

    Therefore \nabla_x L(x',\lambda',\nu') = 0 \implies x' is a minimizer.
    This also implies (\lambda',\nu') is dual feasible since

    \begin{aligned} g(\lambda',\nu') &= \inf\{ L(x,\lambda',\nu'): x \text{ feasible} \} \\ &= L(x',\lambda',\nu')\\ &> -\infty  \end{aligned} .







    Step 3. By Step 2., feasibility of x' and the complementary slackness

    \lambda_i' f_i(x') = 0, \quad i =1,\ldots,m,

    there holds

    \begin{aligned} g(\lambda',\nu') &= L(x',\lambda',\nu')\\ &= f_0(x') + \sum_{i=1}^m\lambda'_i f_i(x') + \sum_{i=1}^p \nu'_i h_i(x')\\ &=f_0(x'). \end{aligned}

    Therefore, the duality gap f_0(x') - g(\lambda',\nu') vanishes and hence x' is primal optimal and (\lambda',\nu') is dual optimal.

    Indeed, recall:

    g(\lambda',\nu') \leq d^\star \leq p^\star \leq f_0(x')

    and so g(\lambda',\nu')=f_0(x') implies

    g(\lambda',\nu') = d^\star = p^\star = f_0(x').









    Corollary. If the primal problem is differentiable, convex and satisfies Slater’s condition, then the KKT conditions are necessary and sufficient for primal and dual optimality and strong duality.

    Viz.: in this situation, finding all solutions to KKT conditions provides all solutions to the given problem.





    Remark. In convex optimization, many algorithms are conceived as methods for solving KKT conditions.
    Moreover, some KKT conditions for some problems may be solved analytically.














    Example 1 (CO Example 5.1)
    Consider the quadratic program

    \text{(QP)} \begin{cases} \text{minimize} &  \frac{1}{2}x^TQx + q^Tx + r\\ \text{subject to} & Ax = b \end{cases}

    with Q \in \boldsymbol{S}_+^n .

    Goal: Derive the KKT conditions for (QP) and solve (QP).





    Step 0. Observe that (QP) is a differentiable convex problem which trivially satisfies Slater’s conditions and so the KKT conditions may be used to solve it.






    Step 1. Find the Lagrangian and its gradient:
    (QP) only has the equality constraint Ax = b and so the Lagrangian is

    L(x,\nu) = \frac{1}{2}x^TQx + q^Tx + r + \nu^T(Ax-b) .

    Differentiating with respect to x gives

    \begin{aligned} \nabla_x L(x,\nu) &= \nabla_x(\frac{1}{2}x^TQx) + \nabla_x(q^Tx) + \nabla_x r + \nabla_x(\nu^TAx - \nu^T b) \\ &= Qx + q +A^T\nu. \end{aligned}







    Step 2. Construct the KKT conditions:
    Since (QP) has no inequality constraints, the KKT conditions take the form

    \begin{aligned} h_i(x^\star) &= 0, \quad i= 1,\ldots,p\\ \nabla_x L(x^\star,\nu^\star) &= 0. \end{aligned}

    Viz., the KKT conditions for (QP) are

    \begin{aligned} Ax^\star &= b \\ Qx^\star + q + A^T\nu^\star&=0. \end{aligned}







    Rewriting the KKT conditions as

    \begin{aligned} Qx^\star + A^T\nu^\star&=-q\\ Ax^\star &= b, \end{aligned}

    the KKT conditions are evidently equivalent to the matrix equation

    \begin{bmatrix} Q & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} x^\star\\ \nu^\star \end{bmatrix} = \begin{bmatrix} -q\\ b \end{bmatrix}.







    Conclusion: Solving the quadratic program

    \begin{cases} \text{minimize} &  \frac{1}{2}x^TQx + q^Tx + r\\ \text{subject to} & Ax = b \end{cases}

    is equivalent to solving the linear equation

    \begin{bmatrix} Q & A^T\\ A & 0 \end{bmatrix} \begin{bmatrix} x^\star\\ \nu^\star \end{bmatrix} = \begin{bmatrix} -q\\ b \end{bmatrix}.
















    Example 2 (CO Example 5.2)
    Consider the convex optimization problem

    \text{(WF)} \begin{cases} \text{minimize} &  -\sum_{i=1}^n \log(\alpha_i + x_i)\\ \text{subject to} & \boldsymbol{1}^Tx  = 1\\ & x \succeq 0 \end{cases}

    where \alpha_i > 0 and \boldsymbol{1} \in \mathbb{R}^n is the vector of 1’s.

    Goal: Derive the KKT conditions for (WF) and solve (WF).





    Step 0. Observe
    • the domain of (WF) contains the nonnegative orthant x \succeq 0 .
    • (WF) is a differentiable convex problem.
    • The condition x \succeq 0 is equivalent to -x \preceq 0 .
    • (WF) satisfies Slater’s condition since there exist -x \prec0 with \boldsymbol{1}^Tx = 1 and so (WF) satisfies strong duality.
    • Therefore, we may use KKT conditions to solve (WF).






    Step 1. Find the Lagrangian and its gradient:
    Given the constraints

    \begin{cases}&\boldsymbol{1}^Tx  = 1\\ & -x \preceq 0 \end{cases}

    the Lagrangian is

    \begin{aligned}  L(x,\lambda,\nu) = -\sum_{i=1}^n \log(\alpha_i + x_i) - \lambda^T x + \nu(\boldsymbol{1}^Tx - 1) \end{aligned}

    with Lagrange multipliers

    \lambda \in \mathbb{R}^n, \quad \nu \in \mathbb{R} .

    Differentiating with respect to x gives

    \begin{aligned}  \nabla_x L(x,\lambda,\nu) &= -\sum_{i=1}^n \nabla_x \log (\alpha_i + x_i) - \nabla_x(\lambda^T x) +\nu \nabla_x(\boldsymbol{1}^Tx) -\nabla_x\nu\\ &= -\frac{1}{\alpha_i + x_i} \boldsymbol{1} - \lambda + \nu \boldsymbol{1}. \end{aligned}

    Thus, x^\star is a critical point of L(x,\lambda,\nu) if it satisfies the system of equations

    -\frac{1}{\alpha_i + x_i} - \lambda_i + \nu = 0, \quad i = 1,\ldots, n .







    Step 2. Construct the KKT conditions:
    Since (WF) has inequality and equality constraints, the KKT conditions take the form

    \begin{aligned} f_i(x^\star) & \leq 0 , \quad i=1,\ldots,m\\ h_i(x^\star) & = 0 , \quad i=1,\ldots,p\\ \lambda_i^\star f_i(x^\star) &= 0, \quad i =1,\ldots,m\\ \lambda^\star &\succeq 0\\ \nabla_x L(x^\star,\lambda^\star,\nu^\star) &=0. \end{aligned}

    Thus, the KKT conditions for (WF) are

    \begin{aligned} x^\star &\succeq 0 \\ \boldsymbol{1}^Tx^\star & = 1\\ \lambda_i^\star x_i^\star &= 0, \quad i =1,\ldots,n\\ \lambda^\star &\succeq 0\\ -\frac{1}{\alpha_i + x_i^\star} - \lambda_i^\star + \nu^\star &= 0, \quad i = 1,\ldots, n . \end{aligned}







    Observe:

    \begin{aligned} \lambda_i^\star x_i^\star &= 0, \quad i =1,\ldots,n\\ \lambda^\star &\succeq 0\\ -\frac{1}{\alpha_i + x_i^\star} - \lambda_i^\star + \nu^\star &= 0, \quad i = 1,\ldots, n . \end{aligned}

    is equivalent to

    \begin{aligned}  \left(\nu^\star - \frac{1}{\alpha_i + x_i^\star} \right) x_i^\star &= 0, \quad i =1,\ldots,n\\ \nu^\star &\geq \frac{1}{\alpha_i + x_i^\star}, \quad i = 1,\ldots, n . \end{aligned}

    (In particular, \lambda^\star is acting as a slack variable.)





    Therefore, we wish to solve:

    \begin{aligned} x^\star &\succeq 0 \\ \boldsymbol{1}^Tx^\star & = 1\\ \left(\nu^\star - \frac{1}{\alpha_i + x_i^\star} \right) x_i^\star &= 0, \quad i =1,\ldots,n\\ \nu^\star &\geq \frac{1}{\alpha_i + x_i^\star}, \quad i = 1,\ldots, n . \end{aligned}

    We will solve for x_i^\star in terms of \nu^\star by considering two cases.





    Case 1: \nu^\star < \frac{1}{\alpha_i} .
    Observe

    \frac{1}{\alpha_i + x_i^\star} \leq \nu^\star < \frac{1}{\alpha_i}

    is only possible for x\succeq 0 if x_i^\star >0 .
    Then the complementary slackness

    \left(\nu^\star - \frac{1}{\alpha_i + x_i^\star} \right) x_i^\star = 0

    enforces \nu^\star = \frac{1}{\alpha_i + x_i}.





    Case 2: \nu^\star \geq \frac{1}{\alpha_i}
    If x_i >0 , then

    \nu^\star - \frac{1}{\alpha_i + x_i^\star}>0

    and so the complementary slackness

    \left(\nu^\star - \frac{1}{\alpha_i + x_i^\star} \right) x_i^\star = 0

    furnishes the contradiction x_i^\star = 0 .
    Thus \nu^\star \geq \frac{1}{\alpha_i} \implies x_i^\star = 0 .





    Putting the two cases together:

    \begin{aligned} x_i^\star &=  \begin{cases} \frac{1}{\nu^\star} - \alpha_i & \nu^\star < \frac{1}{\alpha_i}\\ 0 & \nu^\star \geq \frac{1}{\alpha_i} \end{cases}\\ &= \max\{ 0 ,\frac{1}{\nu^\star} - \alpha_i \}. \end{aligned}





    Next, using \boldsymbol{1}^Tx = 1 , we get

    \begin{aligned} \sum_{i=1}^n \max\{0,\frac{1}{\nu^\star} - \alpha_i\} =1. \end{aligned}

    This is enough to solve for \nu^\star and hence x^\star .





    Further details: Consider the function

    G(t) = \sum_{i=1}^n \max\{0,t-\alpha_i\}

    with 0<\alpha_1 < \alpha_2 < \cdots < \alpha_n .
    Observe

    \begin{aligned} \text{on }[0,\alpha_1] \text{ there holds }& G(t)=0\\ \text{on }[\alpha_1,\alpha_2] \text{ there holds }& G(t) = t-\alpha_1\\ \text{on }[\alpha_2,\alpha_3] \text{ there holds }& G(t) = t-\alpha_1 + t-\alpha_2 = 2t - \alpha_1 - \alpha_2 \end{aligned}

    and so on.
    Moreover, G is continuous.
    Thus G(t) is an increasing continuous piecewise linear function.
    Then G(t) = 1 may be solved by finding when the graph of G(t) crosses the horizontal line y = 1 .














    Perturbation Given the OP:

    \text{(OP)} \begin{cases} \text{minimize}& f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i=1,\ldots,m\\ & h_i(x) = 0, \quad i=1,\ldots,p, \end{cases}

    a natural question is:

    How do p^\star and x^\star behave under perturbations of constraints?

    More precisely: given u \in \mathbb{R}^m , v \in \mathbb{R}^p , consider the perturbed problem

    \text{(OP)}_{uv} \begin{cases} \text{minimize}& f_0(x)\\ \text{subject to} & f_i(x) \leq u_i, \quad i=1,\ldots,m\\ & h_i(x) = v_i, \quad i=1,\ldots,p, \end{cases}

    Observe
    • (u,v)=(0,0) results in \text{(OP)}_{uv} = \text{(OP)}_{00} = \text{(OP)} ;
    • u_i>0 results in relaxing f_i(x) \leq 0 ;
    • u_i<0 results in tightening f_i(x) \leq 0 ;
    • v_i \neq 0 results in “translating” solution set of h_i(x)=0.






    Example 1.
    The image below depicts various perturbations in inequality constraints (the shaded regions) and equality constraints (the dashed contours).
    N.B.: perturbing the equality constraint results in using different contours.






    Example 2.
    The three images below depict the constraint

    x^2+y^2 - 1 \leq u

    with u=-0.5,0,0.5, respectively.
    The three images below depict the constraint

    x+y-1=v

    with u=-0.5,0,0.5, respectively.






    The optimality function
    For u \in \mathbb{R}^m and v \in \mathbb{R}^p , let p^\star(u,v) denote the primal optimal value for \text{(OP)}_{uv} .
    Can therefore introduce the function

    \begin{aligned} p^\star(\cdot,\cdot) : \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R} \end{aligned}

    which assigns perturbation (u,v) the primal optimal value p^\star(u,v) . N.B.:
    • p^\star(0,0)=p^\star= primal optimal value of unperturbed problem;
    • If p^\star(u,v)=+\infty , then the perturbation (u,v) makes the problem infeasible.
    • If p^\star(u,v)=-\infty , then the perturbation (u,v) makes the problem unbounded.
    • If (OP) is convex, then p^\star(u,v) is convex in (u,v) .






    Example
    Consider the problem

    \begin{cases} \text{minimize} &  f_0(x) = -\sqrt{x}\\ \text{subject to} & x-1 \leq 0 \end{cases}.

    Given \text{dom}\,f_0 = \mathbb{R}_+ and the constraint, problem is:

    minimize -\sqrt{x} on [0,1].

    Evidently, x^\star = 1,  p^\star = -1 .





    Consider now the perturbation

    \begin{cases} \text{minimize} &  f_0(x) = -\sqrt{x}\\ \text{subject to} & x-1 \leq u \end{cases}.

    Viz.,

    minimize -\sqrt{x} on [0,1+u].

    Given \text{dom}\,f_0 = \mathbb{R}_+ , perturbed problem is feasible only for u \geq -1 .
    Can thus conclude

    p^\star(u) =  \begin{cases} -\sqrt{1+u} & u \geq -1\\ +\infty & u<-1 \end{cases}.







    The graph of p^\star(u) is plotted below; observe the behavior of p^\star(u) as the constraint is relaxed.















    Sensitivity Theorem. If the primal has strong duality and the dual optimal d^\star is achieved by dual feasible (\lambda^\star,\nu^\star), then

    p^\star(0,0)-  \lambda^{\star T}u - \nu^{\star T}v \leq p^\star(u,v)

    for any u \in \mathbb{R}^m and v \in \mathbb{R}^p .


    Proof. Fix the perturbation vector (u,v) and let x be feasible for the resulting perturbed problem:

    \begin{aligned} f_i(x) &\leq u_i, \quad i=1,\ldots,m\\ h_i(x) &= v_i, \quad i=1,\ldots,p. \end{aligned}







    Observe:

    \begin{aligned} \begin{aligned} f_i(x) &\leq u_i\\ \lambda^\star & \succeq 0 \end{aligned} &\implies \lambda_i^\star f_i(x) \leq \lambda_i^\star u_i \\ &\implies \sum_{i=1}^m \lambda_i^\star f_i(x) \leq \lambda^{\star T}u\\ h_i(x) = v_i &\implies \sum_{i=1}^p \nu_i^\star h_i(x) = \nu^{\star T} v. \end{aligned}







    Using this and strong duality gives

    \begin{aligned} p^\star(0,0)&=g(\lambda^\star,\nu^\star)\\ &\leq f_0(x) + \sum_{i=1}^m \lambda_i^\star f_i(x) + \sum_{i=1}^p \nu_i^\star h_i(x)\\ &\leq f_0(x) + \lambda^{\star T}u + \nu^{\star T}v. \end{aligned}

    Rearranging gives

    \begin{aligned} p^\star(0,0)-  \lambda^{\star T}u - \nu^{\star T}v \leq f_0(x) \end{aligned}

    for all x feasible for the perturbed problem.
    Since LHS independent of x , there holds

    \begin{aligned} p^\star(0,0)-  \lambda^{\star T}u - \nu^{\star T}v \leq p^\star(u,v). \end{aligned}










    Remark. Using this theorem, the sizes and signs of \lambda_i,\nu_i may determine the sensitivity of the primal optimal value subjected to perturbation.





    Example Suppose Theorem inequality takes the form

    p^\star(u,v) \geq p^\star(0,0) - \lambda^\star u - \nu^\star v, \quad \lambda,\nu,u,v\in\mathbb{R}.

    We make four observations.
    1. The larger \lambda^\star is, the easier tightening f_1(x)\leq0 results in increasing p^\star(u,v) .
      Consider \lambda^\star = 100 :

      \begin{aligned} f_1(x) \leq -0.01 &\implies p^\star(-0.01,0) \geq p^\star(0,0) + 1 \\ f_1(x) \leq -0.1 &\implies p^\star(-0.1,0) \geq p^\star(0,0) + 10  \\ f_1(x) \leq -1 &\implies p^\star(-1,0) \geq p^\star(0,0) + 100  \\ \end{aligned}








    2. The smaller \lambda^\star is, the more flexibility we have to relax f_1(x) \leq 0 without decreasing p^\star(u,v) too much.
      Consider \lambda^\star = 0.01 :

      \begin{aligned} f_1(x) \leq 1 &\implies p^\star(1,0) \geq p^\star(0,0) - 0.01 \\ f_1(x) \leq 10 &\implies p^\star(10,0) \geq p^\star(0,0) - 0.1  \\ f_1(x) \leq 100 &\implies p^\star(100,0) \geq p^\star(0,0) - 1 \\ \end{aligned}








    3. When \nu^\star v <0 : the larger \nu^\star is, the easier changing v increases p^\star(u,v) .
      Consider \nu^\star = \pm 100 :

      \begin{aligned} h_1(x) = \mp 0.01 &\implies p^\star(0,\mp 0.01) \geq p^\star(0,0)  + 1 \\ h_1(x) = \mp 0.1 &\implies p^\star(0,\mp 0.1) \geq p^\star(0,0)  + 10 \\ h_1(x) = \mp 1 &\implies p^\star(0,\mp 1) \geq p^\star(0,0)  + 100 \\ \end{aligned}








    4. When \nu^\star v > 0 : the smaller \nu^\star is, the more flexibility we have to change v without decreasing p^\star(u,v) too much.
      Consider \nu^\star = \pm 0.01:

      \begin{aligned} h_1(x) = \pm 1 &\implies p^\star(0,\pm 1) \geq p^\star(0,0)  - 0.01 \\ h_1(x) = \pm 10 &\implies p^\star(0,\pm 10) \geq p^\star(0,0)  - 0.1 \\ h_1(x) = \pm 100 &\implies p^\star(0,\pm 100) \geq p^\star(0,0)  - 10 \\ \end{aligned}
















    Example This example reviews KKT conditions, Slater’s condition, perturbation and sensitivity.
    Consider the LP

    \text{(LP)} \begin{cases} \text{minimize} & \frac{1}{2}x - \frac{1}{2}y + 1\\ \text{subject to}& \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq 0 \\ & x+y = R \end{cases},

    where R>0 is fixed.





    The Lagrangian and its gradient
    Since the constraints are

    \begin{cases} & \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq 0 \\ & x+y = R \end{cases},

    the Lagrangian takes the form

    \begin{aligned} L(x,y,\lambda,\nu) &= \frac{1}{2}x - \frac{1}{2}y + 1 \\ &+ \frac{\lambda}{R^2}x^2 + \frac{\lambda}{R^2}y^2 - \lambda \\ &+ \nu x + \nu y - \nu R. \end{aligned}

    Differentiating with respect to (x,y) gives

    \nabla L =  \begin{bmatrix} \frac{1}{2} + \frac{2\lambda}{R^2}x + \nu\\ -\frac{1}{2} + \frac{2\lambda}{R^2}y + \nu\\ \end{bmatrix}.







    KKT Conditions
    Since (LP) has both inequality and equality constraints, its KKT conditions take the form

    \begin{aligned} f_i(x^\star) & \leq 0 , \quad i=1,\ldots, m\\ h_i(x^\star) & = 0 , \quad i=1,\ldots,p\\ \lambda_i^\star f_i(x^\star) &= 0, \quad i =1,\ldots,m\\ \lambda^\star &\succeq 0\\ \nabla L(x^\star,\lambda^\star,\nu^\star) &=0. \end{aligned}

    Therefore, we wish to solve

    \begin{aligned} \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq 0\\ x+y-R=0\\ \lambda\left(\frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1\right) = 0\\ \lambda \geq 0\\ \frac{1}{2} + \frac{2\lambda}{R^2}x + \nu &= 0\\ -\frac{1}{2} + \frac{2\lambda}{R^2}y + \nu &=0 \end{aligned}







    Observe:

    \begin{aligned} \lambda &=0\\ \nabla L &= 0 \end{aligned} \implies \begin{aligned} \frac{1}{2} + \nu &= 0\\ -\frac{1}{2} + \nu &= 0 \end{aligned},

    which is impossible and so \lambda > 0 .

    Using complementary slackness and \lambda > 0 gives

    \begin{aligned} \lambda\left(\frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1\right) = 0 \implies \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 =0. \end{aligned}







    Therefore, x and y must solve

    \begin{aligned} x+y-R &=0\\ \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 &=0. \end{aligned}

    Equivalently,

    \begin{aligned} \frac{1}{R^2}(R-y)^2 + \frac{1}{R^2}y^2 - 1 =0, \end{aligned}

    which has solutions y=0,R , and whence x=R,0, respectively.





    Using \nabla L = 0 and (x,y)=(R,0) gives

    \begin{aligned} \frac{1}{2} + \frac{2\lambda}{R^2}x + \nu &= 0\\ -\frac{1}{2} + \frac{2\lambda}{R^2}y + \nu &=0 \end{aligned} \implies \begin{aligned} \frac{1}{2} + \frac{2\lambda}{R} + \nu &= 0\\ -\frac{1}{2} + \nu &=0, \end{aligned}

    which has no solution for \lambda>0 .
    Therefore, (x,y)=(R,0) is not optimal.





    Using \nabla L = 0 and (x,y)=(0,R) gives

    \begin{aligned} \frac{1}{2} + \frac{2\lambda}{R^2}x + \nu &= 0\\ -\frac{1}{2} + \frac{2\lambda}{R^2}y + \nu &=0 \end{aligned} \implies \begin{aligned} \frac{1}{2}  + \nu &= 0\\ -\frac{1}{2} + \frac{2\lambda}{R} +  \nu &=0, \end{aligned}

    which has solution (\lambda^\star,\nu^\star)=(\frac{R}{2},-\frac{1}{2}).





    Conclusion.

    \begin{aligned} \text{primal optimal point} &= (x^\star,y^\star)=(0,R)\\ \text{dual optimal point} &= (\lambda^\star,\nu^\star)=(\frac{R}{2},-\frac{1}{2})\\ \text{primal optimal value} &= p^\star\\ &=\frac{1}{2}\cdot 0 - \frac{1}{2}R + 1\\ & = -\frac{R}{2}+1 \end{aligned}







    Sensitivity
    Recall the sensitivity inequality:

    p^\star(0,0)-  \lambda^{\star T}u - \nu^{\star T}v \leq p^\star(u,v)

    This inequality applies to (LP) since it has strong duality and d^\star is achieved.
    Thus

    -\frac{R}{2} + 1 -  \frac{R}{2}u + \frac{1}{2}v \leq p^\star(u,v)

    where p^\star(u,v) is the primal optimal for the perturbed problem

    \begin{cases} \text{minimize} & \frac{1}{2}x - \frac{1}{2}y + 1\\ \text{subject to}& \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq u \\ & x+y - R=v \end{cases}.







    Remarks.
    • By our previous analysis: if R is large, then the problem ought to be sensitive to making u more and more negative.





    • In hindsight, this is obvious: the inequality

      \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq u

      is equivalent to

      x^2 + y^2 \leq R^2 + uR^2

      Therefore, perturbing by making u more and more negative considerably restricts the problem to a smaller and smaller disk.
      In fact, the problem is no longer feasible for any v when u < -1 !






    • Explicitly: straightforward to compute

      p^\star(u,0) =-\frac{R^2}{2}\sqrt{2u+1} +1 .

      Compare

      p^\star(0,0) =-\frac{R}{2}+1 \ll p^\star(u\sim -\frac{1}{2},0) \sim 1 .







    Question: What if we had formulated the problem with the constraint

    x^2 + y^2 - R^2 \leq 0

    instead of

    \frac{1}{R^2}x^2 + \frac{1}{R^2}y^2 - 1 \leq 0 ?

    Why is this problem no longer sensitive to small negative perturbations u ?














    Geometry of Lagrangian Duality Goal: Provide a geometric description of Lagrange duality.

    Restriction: We consider only the one inequality constraint and no equality constraint setting; c.f. CO Section 5.3 for general dimensions.





    One Constraint Setting
    Consider the OP

    \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_1(x) \leq 0 \end{cases}

    with x \in \mathbb{R}^n and domain D .
    Construct the vector function

    F(x) =  \begin{bmatrix} f_1(x)\\ f_0(x) \end{bmatrix}.

    Consider the sets

    \begin{aligned} \mathcal{G} &= \{(u,t) : u=f_1(x),t=f_0(x)\text{ for some }x \in D \}\\ &= F(D)\\ \mathcal{G}_{\text{feas}} &= \{(u,t) \in \mathcal{G}: u \preceq 0\}\\ &= \{ (u,t) : u=f_1(x),t=f_0(x) \text{ for some feasible }x\} \end{aligned}

    Thus (u,t) \in \mathcal{G}_{\text{feas}} iff

    there exists x \in D with f_0(x)=t and u=f_1(x)\leq0.







    Let

    \begin{aligned} t^\star &= \inf\{t : \exists u\leq0 \text{ with } (u,t) \in \mathcal{G} \} \\ &=\inf\{t:(u,t) \in \mathcal{G}_{\text{feas}}\}. \end{aligned}

    Intuitively: t^\star is the smallest t such that (u,t) \in \mathcal{G}_{\text{feas}} for some u \leq 0 .
    Thus t^\star is the smallest value f_0(x) can take among feasible x .
    Can therefore conclude: p^\star = t^\star .
    Explicitly:

    \begin{aligned} t^\star &= \inf\{t : \exists u\leq0 \text{ with } (u,t) \in \mathcal{G} \}\\ &= \inf\{t : \exists x \in D \text{ with } f_1(x) \leq 0 , f_0(x)=t \}\\ &= \inf\{f_0(x) : x \text { feasible}\}\\ &=p^\star. \end{aligned}







    Examples
    1. Consider the problem

      \begin{cases} \text{minimize} & \frac{1}{2}s \cos(s)\\ \text{subject to}& 3\log(s-1) - 3 \leq 0 \end{cases}.

      Define

      F(s) =  \begin{bmatrix} 3\log(s-1) - 3\\ \frac{1}{2}s \cos(s) \end{bmatrix}.

      Then F(s) describes a parametric curve in \mathbb{R}^2 .





      The image \mathcal{G} of F(s) is plotted below:

      Question: Which points (A, B, C or?) corresponds to the optimal value p^\star?

      Answer. N.B.: \mathcal{G}_{\text{feas}} corresponds to the portion of \mathcal{G} with u \leq 0 .
      This portion of \mathcal{G} is the dashed line in the graph below.
      Observed above: p^\star corresponds to smallest value the t -coordinate “can take” for (u,t) \in \mathcal{G}_{\text{feas}} .
      Consequently, B corresponds to the point (u,p^\star) .






    2. Consider a problem whose \mathcal{G} is given by the curve and its enclosed region as depicted below:
      Question: Which points (A, B, C or?) corresponds to the optimal value p^\star?

      Answer. p^\star = 3 .
      This is the t -coordinate for the points A and B.
      N.B.: both A and B belong to \mathcal{G}_{\text{feas}} .
      Remark: C is not considered because C is not in \mathcal{G}_{\text{feas}}.






    The Lagrange dual function
    For each \lambda \in \mathbb{R} , define the function

    \begin{aligned} \Gamma_\lambda&:\mathcal{G} \to \mathbb{R}\\ \Gamma_\lambda(u,t) &= \begin{bmatrix} \lambda & 1 \end{bmatrix}\begin{bmatrix}u \\ t \end{bmatrix} = \lambda u + t. \end{aligned}

    But (u,t) \in \mathcal{G} iff u=f_1(x) and t=f_0(x) for some x \in D and so

    \begin{aligned} \Gamma_\lambda(u,t) = \lambda u + t = \lambda f_1(x) + f_0(x). \end{aligned}

    Question: Have we seen this before?
    Answer. \lambda f_1(x) + f_0(x) is exactly the Lagrangian L(x,\lambda) !






    Since (this is an equality of sets of real numbers)

    \begin{aligned} \{ \Gamma_\lambda(u,t) : (u,t) \in \mathcal{G} \} &= \{ \lambda u + t : (u,t) \in \mathcal{G} \}\\ &= \{ L(x,\lambda) : x \in D \} \end{aligned}

    we conclude

    \begin{aligned} g(\lambda) &= \inf \{ L(x,\lambda) : x \in D \}\\ &= \inf \{ \Gamma_\lambda(u,t) : (u,t) \in \mathcal{G} \} \\ &= \inf \{ \lambda u + t: (u,t) \in \mathcal{G} \} . \end{aligned}

    In particular:

    g(\lambda) \leq \lambda u + t

    for all (u,t) \in \mathcal{G} ; i.e.,

    \{ g(\lambda) = \lambda u + t : (u,t) \in \mathbb{R}^2 \}

    is a supporting hyperplane of \mathcal{G} .
    N.B.: g(\lambda) is the t -intercept of this line.
    (This is all only meaningful if g(\lambda) is finite.)





    Weak duality revisited
    Observe:

    \lambda \geq 0 , \quad u \leq 0 \implies \lambda u \leq 0

    and so

    \lambda u + t \leq t.

    Using g(\lambda) \leq \lambda u + t for (u,t) \in \mathcal{G} gives

    g(\lambda) \leq t.

    Since this holds for all t with (u,t) \in \mathcal{G} , we conclude

    g(\lambda) \leq p^\star.

    Since this holds for all \lambda\geq0 , we conclude weak duality:

    d^\star \leq p^\star.







    Example 1.
    Consider a problem whose \mathcal{G} is given by the curve and its enclosed region as depicted below:






    The image below depicts
    • the optimal value p^\star ;
    • the line g(\frac{2}{3}) = \frac{2}{3}u + t ;
    • and the value g(\frac{2}{3}) = 2 given as the t-intercept of this line.






    The image below depicts the line g(\frac{4}{3}) = \frac{4}{3}u+t :






    Remark. Observe that no supporting hyperplane of \mathcal{G} can intersect (0,p^\star).
    Thus, there are no multipliers \lambda^\star such that g(\lambda^\star)= p^\star .
    As a result, this problem is not stronlgy dual.
    This is further indicated in the image below.






    Example 2.
    Consider a problem whose \mathcal{G} is given by the curve and its enclosed region as depicted below:
    Question: Does this problem satisfy strong duality.
    Answer. Yes!
    As the image below depicts, observe that there is a supporting hyperplane passing through (0,p^\star).
    This is enough to conclude d^\star = p^\star and hence strong duality.















    Sketch of Proof of Slater’s Theorem Recall:

    Slater’s Theorem. If a convex optimization problem satisfies Slater’s condition, then it is strongly dual and the dual problem is solvable



    As in the previous section, we consider problems with one inequality constraint:

    \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_1(x) \leq 0 \end{cases}

    where x \in \mathbb{R}^n and both f_0,f_1 are convex.

    Define the epigraph

    \begin{aligned} \mathcal{A} = \{ (u,t) : f_1(x) \leq u, f_0(x)\leq t\text{ for some }x \in D \}. \end{aligned}

    Thus, if

    \xi = (f_1(x),f_0(x)) \in \mathcal{G},

    then every point “above and to the right” of \xi is in \mathcal{A} .





    Example. In the images below, the sets \mathcal{G} and \mathcal{A} are given.
    Important remarks:
    • The point (0,p^\star) is generically on the boundary of \mathcal{A}.
    • Strong duality is prevented exactly because \mathcal{A} is not convex.






    Sketch of Proof of Slater’s Theorem.
    Observe: f_0,f_1 convex \implies \mathcal{A} convex.
    Indeed, if (u,t),(u',t') \in \mathcal{A} and s \in [0,1], then
    • f_1(x) \leq u, f_1(x') \leq u' gives

      \begin{aligned}f_1(sx+(1-s)x') &\leq s f_1(x) + (1-s)f_1(x')\\ & \leq su+ (1-s)u' \end{aligned}

    • f_0(x)\leq t, f_0(x')\leq t' gives

      \begin{aligned}f_0(sx+(1-s)x') &\leq s f_0(x) + (1-s)f_0(x')\\ & \leq st+ (1-s)t' \end{aligned}

    Therefore t(u,t) + (1-t)(u',t') \in \mathcal{A} .





    \mathcal{A} convex \implies for each boundary point P \in bd\mathcal{A} of \mathcal{A} there is a supporting hyperplane \ell_P of \mathcal{A} which intersects P .
    This is depicted below.






    Recall: P^\star:=(0,p^\star) \in bd\mathcal{A}
    Therefore, there is a supporting hyperplane \ell_{P^\star} of \mathcal{A} which intersects P^\star.
    N.B.: \ell_{P^\star} lies below \mathcal{A}.
    This is depicted below.






    Assume Slater’s condition: there exists x' \in D with f_1(x')<0 .
    Let u'<0 be such that f_1(x') < u'<0
    Let t'>f_0(x').
    N.B.: (u',t') is to the right of (f_1(x'),f_0(x')).
    Then (u',t') \in \text{int}(\mathcal{A}) and lies above \mathcal{G}_{\text{feas}}.
    Slater’s condition is what ensures such an interior point exists.
    This is depicted below.






    Since \ell_{P^\star} is a supporting hyperplane below \mathcal{A}, it follows that (u',t') has to lie above \ell_{P^\star}.
    This ensures \ell_{P^\star} is nonvertical and so it has a finite slope \lambda' .
    This is depicted below.






    Evidently,

    g(\lambda') = \inf\{\lambda' u + t: (u,t) \in \mathcal{A}\} = p^\star.

    Therefore, strong duality holds.
    N.B.: If g(\lambda') = c \neq p^\star were the case, then the line \{ c=\lambda' u + t \} would fail to be a supporting hyperplane since it will pass through (0,c) \neq (0,p^\star).
    In particular, either c=\lambda' u +t may be decreased or is unachievable.
    The lines for c=p^\star,a,b with a<p^\star<b is depicted below.















    Theorems of Alternatives Recall the feasibility problem:

    \begin{cases} \text{find} & x\\ \text{subject to} & f_i(x) \leq 0, i = 1, \ldots, m\\ & h_i(x) = 0, i = 1,\ldots, p \end{cases}.

    Goal: Use Lagrange duality to study the feasibility problem.





    Observe that the feasibility problem is equivalent to the minimization problem:

    \text{(FP)} \begin{cases} \text{minimize} & 0\\ \text{subject to} & f_i(x) \leq 0, i = 1, \ldots, m\\ & h_i(x) = 0, i = 1,\ldots, p \end{cases}.

    Indeed, a solution to (FP) exists iff the constraints are consistent.

    The optimal value for (FP) is given by

    p^\star = \begin{cases} 0 & \text{(FP) is feasible}\\ +\infty & \text{(FP) is infeasible} \end{cases}.







    Duality of Feasibility Problem.
    Let

    f(x) = \begin{bmatrix}f_1(x)\\\vdots\\f_m(x)\end{bmatrix},\quad h(x) = \begin{bmatrix} h_1(x) \\\vdots\\h_p(x) \end{bmatrix}.

    Since the objective function of (FP) is f_0(x)=0 , the Lagrangian is

    L(x,\lambda,\nu) = \lambda^T f(x) + \nu^T h(x)

    with Lagrange multipliers (\lambda,\nu) \in \mathbb{R}^m \times \mathbb{R}^p.
    The Lagrange dual function is thus

    g(\lambda,\nu) = \inf\{ \lambda^T f(x) + \nu^T h(x) : x \in D \}.

    The dual of (FP) is therefore

    \begin{cases} \text{maximize} & g(\lambda,\nu)\\ \text{subject to} & \lambda \succeq 0. \end{cases}.







    The dual feasibility problem is thus

    \text{(DFP)} \begin{cases} \text{find} & (\lambda,\nu)\\ \text{subject to} & \lambda \succeq 0\\ & g(\lambda,\nu)>0. \end{cases}.

    Observe for t \geq 0 :

    \begin{aligned} g(t\lambda,t\nu) &= \inf\{ \lambda^T f(x) + \nu^T h(x) : x \in D \}\\ &=t\inf\{ t\lambda^T f(x) + t\nu^T h(x) : x \in D \}. \end{aligned}

    Using this and that g(0,0)=0, we conclude

    d^\star =  \begin{cases} +\infty & \text{(DFP) is feasible}\\ 0 & \text{(DFP) is infeasible} \end{cases}.

    Justification. Indeed: if \exists (\lambda,\nu) with \lambda \succeq 0 and g(\lambda,\nu)>0 , then

    g(t\lambda,t\nu) = tg(\lambda,\nu)>0

    can be made as large and desirable.
    On the other hand, if no such (\lambda,\nu) exist, then the large g(\lambda,\nu) may be is 0.






    Weak Alternatives.
    Recall weak duality asserts d^\star \leq p^\star .
    We also just derived

    p^\star = \begin{cases} 0 & \text{(FP) is feasible}\\ +\infty & \text{(FP) is infeasible} \end{cases}, \quad d^\star =  \begin{cases} +\infty & \text{(DFP) is feasible}\\ 0 & \text{(DFP) is infeasible} \end{cases}.

    Therefore:
    • (FP) feasible \implies p^\star =0 \implies d^\star = 0 \implies (DFP) infeasible.
    • (DFP) feasible \implies d^\star = \infty \implies p^\star = \infty \implies (FP) infeasible.
    In general:

    If at most one of two problems can be feasible at a time, then they are called weak alternatives.

    Therefore, (FP) and (DFP) are weak alternatives.





    Strong Alternatives.
    In general:

    If exactly one of two problems is feasible at a time, then they are called strong alternatives.

    Farkas’ Lemma.
    Let A \in \mathbb{R}^{m \times n} and c \in \mathbb{R}^n .
    Then the feasibility problems

    \begin{cases} \text{find} &x\\ \text{subject to} & Ax \preceq0\\ &c^Tx < 0 \end{cases} \quad \text{and}\quad \begin{cases} \text{find} & y\\ \text{subject to}&A^Ty + c =0\\ &y \succeq 0 \end{cases}

    are strong alternatives.
    Proof. Consider the LP

    \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax \preceq0. \end{cases}

    Its Lagrangian and Lagrange dual are

    \text{(LP)} \begin{aligned} L(x,\lambda) &= c^Tx+ \lambda^TAx = (c+A^T\lambda)^Tx\\ g(\lambda) &= \begin{cases} 0 & c+A^T\lambda =0\\ -\infty & \text{else} \end{cases} \end{aligned}

    Therefore, the dual problem is

    \text{(DLP)} \begin{cases} \text{maximize} & 0\\ \text{subject to} & A^Ty+c=0\\ &y\succeq 0  \end{cases} .

    N.B.: (LP) and (DLP) are strongly dual and so their respective optimal values p^\star,d^\star satisfy p^\star=d^\star.





    Observe:
    • Ax \preceq 0 , c^Tx<0 infeasible \iff p^\star = 0.
    • Ax \preceq 0 , c^Tx<0 feasible \iff p^\star = -\infty.
    • A^Ty+c=0, y\succeq0 infeasible \iff d^\star =-\infty.
    • A^Ty+c=0, y\succeq0 feasible \iff d^\star =0.
    Using strong duality p^\star=d^\star, we conclude that the feasibility problems

    \begin{cases} \text{find} &x\\ \text{subject to} & Ax \preceq0\\ &c^Tx < 0 \end{cases} \quad \text{and}\quad \begin{cases} \text{find} & y\\ \text{subject to}&A^Ty + c =0\\ &y \succeq 0 \end{cases}

    are strong alternatives.















    Descent Algorithms for Unconstrained Minimization
    Overview Unconstrained Minimization.
    We will focus on unconstrained problems of the form

    \begin{cases} \text{minimize} & f(x). \end{cases}

    The main assumptions on f will be
    • f is strongly convex (defined below).
    • f is twice continuously differentiable: \nabla^2 f is continuous.
    • f has a closed sublevel set.
    Constrained minimization problems will come later.

    Goal.
    Formulate and study algorithms which search for the minimizing x^\star that solve the problem: p^\star = f(x^\star) .







    Idea: Using Descent Methods.
    Find iterative rules

    G_k: \text{dom}\,f \to \text{dom}\,f

    so that the sequence

    x^{(k+1)} = G_k(x^{(k)}), \quad k =1, 2, \ldots,

    stabilizes and satisfies descent:

    \begin{aligned} x^{(k)} &\to x' \text{ for some }x' \text{ as }k \to \infty\\ f(x^{(k+1)}) &< f(x^{(k)}) \text{ whenever } x^{(k)} \text{ is not optimal}. \end{aligned}

    N.B.: such rules are natural for searching for minimizers.
    Without convexity, such rules may get “stuck” at local minimizers.
    Example.
    Consider the iterative rule

    G_k(x^{(k)}) = x^{(k)} + h_k \nu^{(k)}

    where h_k \in \mathbb{R} are step sizes and \nu^{(k)} \in \mathbb{R}^n are search directions vector.
    Generally, h_k,\nu^{(k)} may depend on x^{(k)} .
    Thus G_k(x) determines how far to step and in what direction from x.







    Remarks.
    1. If x^{(k)} \to x^\star , then continuity would give

      f(x^{(k)}) \to f(x^\star) = p^\star.

    2. In general, p^\star need not to be known a priori.
    3. In practice, one specifies tolerance \epsilon>0 and terminates search when an iterate x^{(K)} satisfies this tolerance:

      f(x^{(K)})-p^\star \leq \epsilon.

    4. To start the search: a suitable starting point x^{(0)} needs to be chosen.
    5. To stop the search: a suitable stopping criterion that ensures tolerance is met needs to be determined.
    6. Generally G_k depends on the step; if G_k = G is independent of the step, then G is called stationary.








    General Descent Algorithm.
    Given iterative rule G_k satisfying descent, a desired tolerance \epsilon > 0 , and a stopping criterion

    \sigma(x^{(k)}) = \begin{cases} \text{true} & \text{if } f(x^{(k)}) - p^\star \leq \epsilon\\ \text{false} &\text{ else} \end{cases}

    a general descent algorithm takes the form:
    
    given initial x^{(0)} \in \text{dom}\,f .
    repeat: compute x^{(k+1)} = G_k(x^{(k)}).
    until: \sigma(x^{(k)}) = \text{true}.
    
    An natural kind of stopping criterion may be

    \sigma(x^{(k)}) = \begin{cases} \text{true} & \text{if } \Vert \nabla f(x^{(k)})\Vert_2 \text{ is sufficiently small}\\ \text{false} &\text{ else} \end{cases}.

    Indeed, for differentiable convex functions, if \nabla f(x) = 0, then x=x^\star .














    Mathematical Framework Main Assumptions
    For theoretical convenience, we always assume
    1. f satisfies strong convexity (defined below)
    2. f is twice continuously differentiable.
    3. for the chosen initial point x^{(0)}, the sublevel set

      S := \{ x \in \text{dom}\, f : f(x) \leq f(x^{(0)}) \}

      is closed.
    N.B.:
    1. Since f(x^\star) = p^\star \leq f(x^{(0)}), we have x^\star \in S.
    2. S being closed holds in case \text{dom}f\, = \mathbb{R}^n and f is continuous.
      However, S being closed may fail for nontrivial and non-pathological cases; e.g., consider the case \text{dom}\,f is an open ball and x_0 = \text{argmax}f.
      Then S is an open ball in \mathbb{R}^n and therefore not closed.








    Strong convexity.
    We say f satisfies strong convexity on S if there exists a m >0 such that

    m Id  \preceq \nabla^2 f(x) , \quad \forall x \in S.

    N.B.:
    1. Id \in \mathbb{R}^{n \times n} indicates the identity matrix.
    2. Fix x \in S and let v \in \mathbb{R}^n be an eigenvector of \nabla^2 f(x) with eigenvalue \lambda .
      Then

      \nabla^2 f(x) - m Id \succeq0

      implies

      0 \leq v^T (\nabla^2 f(x) - m Id) v = (\lambda-m)\Vert v\Vert_2^2.

      Thus

      0< m \leq \lambda and \nabla^2f(x) \succ 0.

    3. In particular, strong convexity is a stronger form of strict convexity!








    Proposition. If f is strongly convex and S is closed, then there exists M >0 such that

    \nabla^2 f(x) \preceq M Id , \quad \forall x \in S.

    Proof.
    1. The plan is to show that the sublevel set S is closed and bounded (hence compact) and use continuity of \nabla^2 f(x) to conclude each of its matrix entries are bounded.
      This is enough to conclude the inequality.







    2. f twice continuously differentiable implies:
      for each x,y \in S there exists z on the line x\to y such that

      \begin{aligned} f(y) = f(x) + \nabla f(x)^T(y-x) + \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x). \end{aligned}









    3. Strong convexity \nabla^2 f(x) \succeq m Id implies

      \begin{aligned} \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x) \geq \frac{m}{2}\Vert y-x \Vert_2^2. \end{aligned}

      Taking y \in S and x = x^\star in previous step, there holds

      \begin{aligned} f(x^{(0)})& \geq f(y)\\ & \geq f(x^\star) + \nabla f(x^\star)^T(y-x^\star) + \frac{m}{2}\Vert y-x^\star \Vert_2^2\\ &= p^\star  + \frac{m}{2}\Vert y-x^\star \Vert_2^2. \end{aligned}









    4. Previous step gives

      \begin{aligned} \frac{2}{m}\left( f(x^{(0)})-p^\star \right) \geq \Vert y-x^\star \Vert_2^2, \end{aligned}

      which implies all y \in S belong to a ball of sufficiently large radius with center x^\star , and therefore S is bounded.







    5. Since S is bounded and closed, it is compact.
      Therefore \nabla^2 f is continuous on a compact set.
      Therefore each entry of \nabla^2 f is bounded and hence \nabla^2 f(x) \preceq M Id for sufficiently large M.










    Remark.
    Just as mId \preceq \nabla^2 f(x) gave a lower bound on the eigenvalues of \nabla^2 f(x) , the bound \nabla^2 f(x) \preceq MId provides an upper bound on the eigenvalues.
    The proof is mutatis mutandis the same.







    Proposition. For x \in S there holds

    \frac{1}{2m}\Vert \nabla f(x) \Vert_2^2  \geq f(x) - p^\star \geq \frac{1}{2M}\Vert \nabla f(x) \Vert_2^2.

    Proof.
    1. Observe: if a matrix A \in \mathbb{R}^{n \times n} satisfies

      m Id \preceq A \preceq M Id,

      then for all v \in \mathbb{R}^n there holds

      \begin{aligned} m\Vert v \Vert_2^2 = v^T(m Id)v \leq v^TAv \leq v^TM Idv = M\Vert v \Vert_2^2. \end{aligned}



      Therefore, using

      m Id \preceq \nabla^2 f(x) \preceq M Id,

      we have

      \begin{aligned} \frac{m}{2} \Vert y-x \Vert_2^2 \leq \frac{1}{2}(y-x) \nabla^2 f(z) (y-x) \leq \frac{M}{2}\Vert y-x \Vert_2^2. \end{aligned}









    2. As above: for each x,y \in S there exists z on the line x\to y such that

      \begin{aligned} f(y) = f(x) + \nabla f(x)^T(y-x) + \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x). \end{aligned}









    3. Observe that:

      q(y):= f(x) + \nabla f(x)^T (y-x) + \frac{c}{2}\Vert{y-x}\Vert_2^2

      is a convex quadratic for c>0 .
      Moreover,

      \nabla q(y_0) = \nabla f(x) + c(y_0-x) =0

      iff

      y_0 = x - \frac{1}{c} \nabla f(x).

      Therefore

      \begin{aligned} \text{min}\, q(y) &= q(y_0)\\ &=q(x - \frac{1}{c} \nabla f(x))\\ &= f(x) - \frac{1}{2c}\Vert \nabla f(x) \Vert_2^2 \end{aligned}









    4. Using \nabla^2 f(x) \preceq M Id, we have

      \begin{aligned} f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{M}{2}\Vert y-x \Vert_2^2 \end{aligned}

      and minimizing over y gives

      \begin{aligned} p^\star \leq f(x) - \frac{1}{2M}\Vert \nabla f(x) \Vert_2^2. \end{aligned}

      This proves

      \begin{aligned}  \frac{1}{2M}\Vert \nabla f(x) \Vert_2^2 \leq f(x) - p^\star . \end{aligned}









    5. Using m Id \preceq \nabla^2 f(x) we have

      \begin{aligned} f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{m}{2}\Vert y-x \Vert_2^2 \end{aligned}

      and minimizing over y gives

      \begin{aligned} p^\star \geq f(x) - \frac{1}{2m}\Vert \nabla f(x) \Vert_2^2. \end{aligned}

      This proves

      \begin{aligned} \frac{1}{2m}\Vert \nabla f(x) \Vert_2^2 \geq f(x) - p^\star . \end{aligned}











    Remark.
    The upper bound provides a stopping criterion: if x^{(K)} satisfies

    \Vert \nabla f(x^{(K)}) \Vert_2 \leq \sqrt{2m\epsilon},

    then

    f(x^{(K)}) - p^\star \leq \epsilon.

    Viz., x^{(K)} satisfies \epsilon-tolerance.
    Yet, p^\star does not even need to be known for this.







    Proposition. For x \in S, there holds

    \Vert x^\star - x \Vert_2 \leq \frac{2}{m} \Vert \nabla f(x) \Vert_2.

    Proof. Again, we use

    \begin{aligned} f(y) = f(x) + \nabla f(x)^T(y-x) + \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x) \end{aligned}

    by taking y = x^\star , which gives

    \begin{aligned} p^\star &= f(x^\star) \\ &= f(x) + \nabla f(x)^T(x^\star-x) + \frac{1}{2}(x^\star-x)^T \nabla^2 f(z)(x^\star-x)\\ &\geq f(x) - \Vert \nabla f(x)^T\Vert_2 \Vert x^\star - x\Vert_2 + \frac{m}{2} \Vert x^\star - x \Vert_2^2. \end{aligned}

    Here we used Cauchy-Schwarz to conclude

    \begin{aligned}  \Vert \nabla f(x)^T\Vert_2 \Vert x^\star -x \Vert_2 \geq \nabla f(x)^T(x-x^\star) \end{aligned}

    and so

    \begin{aligned}  -\Vert \nabla f(x)^T\Vert_2 \Vert x^\star -x \Vert_2 \leq \nabla f(x)^T(x^\star - x) \end{aligned}









    Since p^\star - f(x) \leq 0 , we conclude

    \begin{aligned} 0 \geq - \Vert \nabla f(x)^T\Vert_2 \Vert x^\star - x\Vert_2 + \frac{m}{2} \Vert x^\star - x \Vert_2^2. \end{aligned}

    and whence

    \begin{aligned}  \frac{2}{m}\Vert \nabla f(x)^T \Vert_2 \geq \Vert x^\star - x \Vert_2. \end{aligned}



    Corollary. If x^\star,x^{\star\star} are two minimizers, then x^\star = x^{\star\star} since

    \Vert x^\star - x^{\star\star} \Vert_2 \leq \frac{2}{m} \Vert \nabla f(x^{\star\star}) \Vert_2 = 0.
















    General Descent Methods Plan: Further specialize and describe general descent methods.







    Minimizing sequence: a sequence \{ x^{(k)}\} \subset \text{dom}\,f such that

    x^{(k)} \to x^\star as k \to \infty .

    Goal: construct a minimizing sequence \{ x^{(k)}\} satisfying
    • each iterate x^{(k+1)} is defined via

      x^{(k+1)} = x^{(k)} + t^{(k)}\Delta x^{(k)}

      where

      \begin{aligned} t^{(k)}\geq0 & \text{ is called the step size}\\ \Delta x^{(k)} \in \mathbb{R}^n & \text{ is called the search direction}. \end{aligned}

    • the t^{(k)} and \Delta x^{(k)} are chosen so that the sequence satisfies descent:

      f(x^{(k+1)}) < f(x^{(k)}) whenever x^{(k)} \neq x^\star .









    Remarks.
    1. \Delta x^{(k)} is generally not assumed to be a unit vector;
    2. A minimizing sequence satisfying descent satisfies

      x^{(k)} \in S := \{ x : f(x) \leq f(x^{(0)} ) \} .

    3. Customary to write x:= x+ t \Delta x or x^+ = x + t\Delta x as short hand for x^{(k+1)} = x^{(k)} + t^{(k)} \Delta x^{(k)} .








    Necessary condition for descent: since f is convex and differentiable, there holds

    f(x) + \nabla f(x)^T(y-x) \leq f(y),

    and so

    \nabla f(x)^T(y-x) \geq 0 \implies f(x) \leq f(y).



    Therefore, for x:= x+t\Delta x to satisfy descent, it is necessary that

    \nabla f(x^{(k)})^T \Delta x^{(k)} <0.

    This follows by multiplying \nabla f(x^{(k)})^T to both sides of

    \Delta x^{(k)} = \frac{1}{t^{(k)}} (x^{(k+1)} - x^{(k)} ).

    Descent direction: any search direction \Delta x satisfying

    -\nabla f(x)^T \Delta x > 0.

    Viz., \Delta x and - \nabla f(x) form an acute angle.








    General descent stopping criterion.
    Recall: strong convexity m Id \preceq \nabla^2 f(x) ensures

    \Vert \nabla f(x) \Vert_2 \leq \sqrt{2m\epsilon} \implies f(x)-p^\star \leq \epsilon .

    Since it is generally impossible to know the strong convexity constant m , one settles with choosing \epsilon'>0 sufficiently small so that

    \Vert \nabla f(x) \Vert_2 \leq \epsilon' \implies f(x)-p^\star \leq \epsilon

    is likely to hold.
    Stopping criteria for the descent methods studied here are often of this form.








    General descent algorithm: Using the iterative rule x := x + t \Delta x, a general descent algorithm takes the following form.
    
    given initial x \in \text{dom}\,f 
    repeat:
    1. Determine descent direction \Delta x .
    2. Choose step size t>0.
    3. Take step: x:= x + t\Delta x .
    until: stopping criterion holds.
    















    Line Searching Observe

    \{ x + t \Delta x : t \geq 0 \}

    is a ray emanating from x in the direction \Delta x.
    Thus Step 2. in the general descent algorithm is to determine where to step onto this line from x .
    Step 2. is therefore called a line search.







    Exact line search.
    Let t_{\text{exact}} minimize f along the line \{ x + t \Delta x: t \geq 0 \} .
    (Such t_{\text{exact}} exists since f convex.)
    Certainly f(x+t_{\text{exact}}\Delta x) \leq f(x) .
    This search for t_{\text{exact}} is called an exact line search.
    A general descent algorithm with exact line search is recorded below.
    
    given initial x \in \text{dom}\,f 
    repeat:
    1. Determine descent direction \Delta x .
    2. Compute t_{\text{exact}} = \text{argmin}\{f(x+t\Delta x): t \geq 0 \} .
    3. Take step: x:= x + t_{\text{exact}}\Delta x .
    until: stopping criterion holds.
    








    Remarks.
    1. Let t^{(k)}_{\text{exact}} be sequence of exact step sizes and let

      x^{(k+1)} = x^{(k)} + t_{\text{exact}}^{(k)}\Delta x^{(k)}

      be resulting sequence of steps.
      Since a search direction is used, each t_{\text{exact}}^{(k)}>0 and so the sequence x^{(k)} satisfies descent:

      f(x^{(k+1)}) < f(x^{(k)}).

    2. Using exact search is only reasonable when its computational cost is considerably less than the computational cost of finding search directions \Delta x .
      Otherwise, resources can be better spent finding better search directions.








    Examples
    Example 1.
    Consider the objective

    f(x,y) = \frac{1}{2}(x-y)^2 + y .

    The following image depicts a portion of the graph of f.








    The next figure depicts the restrictions of f to the lines

    \begin{aligned} &\{ (0,0.5) + t (1,0): t \geq 0 \}\\ &\{ (0,1) + t (1,0): t \geq 0 \}\\ &\{ (0,1.5) + t (1,0): t \geq 0 \} \end{aligned}









    The last image depicts only these restrictions with the corresponding exact x + t_{\text{exact}} \Delta x obtained from exact line search on each respective line.








    Example 2.
    Consider the same objective

    f(x,y) = \frac{1}{2}(x-y)^2 + y .

    The first image below depicts the first step of a general descent method using exact line search where

    \begin{aligned} x^{(0)} &= (0,1.5)\\ \Delta x^{(0)} &= (1,0)\\ t_{\text{exact}}^{(0)} &= 1.5\\ x^{(1)} &= x^{(0)} + t_{\text{exact}}^{(0)} \Delta x^{(0)} = (1.5,1.5). \end{aligned}









    The next image below depicts the second step using exact line search where

    \begin{aligned} x^{(1)} &= (1.5,1.5)\\ \Delta x^{(1)} &= (0,-1)\\ t_{\text{exact}}^{(1)} &= 1\\ x^{(2)} &= x^{(1)} + t_{\text{exact}}^{(1)} \Delta x^{(1)} = (1.5,.5). \end{aligned}









    The last image below depicts the third step using exact line search where

    \begin{aligned} x^{(2)} &= (1.5,.5)\\ \Delta x^{(2)} &= (-1,0)\\ t_{\text{exact}}^{(2)} &= 1\\ x^{(3)} &= x^{(2)} + t_{\text{exact}}^{(2)} \Delta x^{(2)} = (.5,.5). \end{aligned}









    Backtracking line search.
    Naturally: exact line search may be too computationally expensive.

    Therefore, we may settle with line search which either
    • decreases the objective f enough or
    • approximately minimizes f in the direction \Delta x.


    Idea: given descent direction \Delta x and parameter \beta \in (0,1),
    1. take step x \mapsto x + t \Delta x
    2. “backtrack”: test smaller steps x \mapsto x + \beta^k t\Delta x until the decrease

      f(x+\beta^k t\Delta x) - f(x)

      behaves suitably at each iteration for convergence to hold.
    N.B.: Even if the initial step x \mapsto x+t\Delta x results in an increase in objective, convexity ensures x \mapsto x+\beta^k t\Delta x results in decrease for some k.







    Motivation of backtracking line search:
    Throughout, fix
    • f:\mathbb{R}\to\mathbb{R}
    • descent direction \Delta x, i.e., \Delta x \in \mathbb{R} and f'(x)\Delta x < 0
    • Parameters \alpha,\beta \in (0,1).








    Observations:
    1. For t small, Taylor’s approximation gives

      f(x+t\Delta x) - f(x) \approx tf'(x) \Delta x.

      In particular, small steps guarantee an approximate decrease by the amount tf'(x) \Delta x.








    2. At worst, a general step size t makes

      x\mapsto x+t\Delta x

      either overshoot the minimizer

      x + t_{\text{exact}} \Delta x

      or is too small for

      x + t\Delta x

      to be “near”

      x+t_{\text{exact}}\Delta x .









    3. Consider the linear extrapolation

      y(t) = tf'(x) \Delta x + f(x)

      as a function of t.
      Since y(0) = f(x) , the difference y(t) - f(x) is a linear approximation of the difference f(x+t\Delta x) - f(x) .
      The line searching stopping criterion:
      
      until: f(x+t\Delta x) - f(x) \leq  \alpha tf'(x) \Delta x. 
      
      amounts to searching for t until f(x) decreases by a fraction of what the linear approximation gives.








    4. By 1. and f'(x)\Delta x < 0 , the inequality in 3. is guaranteed for small t .
      Indeed

      f(x+t\Delta x) - f(x) \approx tf'(x) \Delta x \leq  \alpha tf'(x) \Delta x.









    5. By 4., we observe that, if

      f(x+t\Delta x) - f(x) \leq  \alpha tf'(x) \Delta x.

      is not satisfied after first step x\mapsto x + t \Delta , then

      f(x+\beta^k t\Delta x) - f(x) \leq  \alpha \beta^k tf'(x) \Delta x

      will be satisfied for large enough k .
      Can therefore consider the largest step size \beta^k t which guarantees a decrease comparable to the decrease predicted by linear extrapolation.








    Improved Idea:
    1. take step x \mapsto x + t \Delta x
    2. “backtrack”: find the first k\geq0 such that

      f(x+\beta^k t\Delta x) - f(x) \leq  \alpha \beta^k tf'(x) \Delta x

      is satisfied.
    3. Then (as we will see) we have “suitable decrease” and we proceed with choosing next descent direction.








    The aforementioned observations and ideas motivate a line search called backtracking line search.
    In higher dimension, the differential inequality

    f(x+t\Delta x) - f(x) \leq  \alpha tf'(x) \Delta x

    takes the form

    f(x+t\Delta x) - f(x) \leq  \alpha t\nabla f(x)^T \Delta x.

    This inequality is called the Armijo-Goldstein inequality or Armijo condition.
    The algorithm for backtracking line search may now be recorded:
    
    given 
    x \in \text{dom}\,f 
    descent direction \Delta x at x
    parameters \alpha \in (0,0.5), \beta \in (0,1)
    t = 1
    while: f(x+t\Delta x) - f(x) >  \alpha  t \nabla f(x)^T \Delta x 
    update: t:=\beta t
    








    Remarks.
    1. The loop
      
      while: f(x+t\Delta x) - f(x) >  \alpha  t \nabla f(x)^T \Delta x 
      update: t:=\beta t
      
      creates a sequence of step sizes

      1, \beta, \beta^2 ,  \beta^3 ,  \cdots,  \beta^k ,   \cdots

      and the sequence terminates once the exit criterion (Armijo condition) is satisfied.
      N.B.: it is possible for the search to terminate at t=1 .







    2. The while condition
      
      while: f(x+t\Delta x) - f(x) >  \alpha  t \nabla f(x)^T \Delta x 
      
      is understood as waiting until the objective is suitably decreased.
      Moreover, if Armijo-Goldstein inequality holds for some t_0 , it also holds for all 0 < t \leq t_0 .
      Indeed, by

      \begin{aligned} \lim_{t\to0^+} \frac{f(x+t\Delta x) - f(x)}{t} = \nabla f(x)^T \Delta x \leq \alpha \nabla f(x)^T \Delta x, \end{aligned}

      there is a small enough t_0>0 such that

      \begin{aligned} \frac{f(x+t\Delta x) - f(x)}{t} \leq \alpha \nabla f(x)^T \Delta x, \quad \text{ for } 0 < t \leq t_0. \end{aligned}

      This is the Armijo-Goldstein inequality rearranged.







    3. The assumption \alpha \in (0,0.5) and Armijo’s condition are sufficient for convergence of gradient descent coupled with backtracking line search, which is detailed below.
      The smaller \beta is, the faster \beta^k decreases and hence the quicker the Armijo-Goldstein inequality holds.
      However, this also results in smaller steps x \mapsto x + \beta^k t\Delta x.







    4. One needs to ensure f(x+t\Delta x) is well-defined to start the algorithm, i.e., that x + t \Delta x \in \text{dom}\, f.
      This can be done by taking t to be the first \beta^k t with x + \beta^k t\Delta x \in \text{dom}\, f.







    5. This algorithm always terminates for differentiable and convex f.
      The argument follows the one-dimensional argument.
      Indeed Taylor’s approximation ensures: for small t there holds

      f(x + t \Delta x) - f(x) \approx t \nabla f(x)^T \Delta x < \alpha t \nabla f(x)^T \Delta x.

      Thus, once k is large enough for t=\beta^k be in the “small t” range, the Armijo-Goldstein inequality holds.







    Gradient Descent Recall: \Delta x is called a descent direction provided

    \nabla f(x)^T \Delta x < 0 .

    Therefore

    \Delta x = -\nabla f(x) .

    provides a natural descent direction associated to the problem.
    This results in the following gradient descent method.
    
    given initial x \in \text{dom}\,f 
    repeat:
    1. Set \Delta x = - \nabla f(x) .
    2. Perform line search to determine step size t .
    3. Take step: x:= x + t\Delta x .
    until: stopping criterion holds.
    
    We focus on the cases the line search is exact or backtracking.
    Recall: by strong convexity and initial sublevel set S being closed, there exist absolute constants 0 < m \leq M <\infty such that

    m Id \preceq \nabla^2 f(x) \preceq M Id \quad \text{ for } x \in S.









    Convergence for exact line search.
    We show
    • gradient descent with exact line search converges.
    • number of iterations needed to achieve tolerance f(x)-p^\star \leq \epsilon is bounded in terms of the problem data:
      • optimal value p^\star and initial value f(x^{(0)}),
      • desired tolerance \epsilon>0,
      • and the conditioning of \nabla^2 f(x) .








    Theorem. Suppose f is strongly convex with convexity constants m,M and its initial sublevel set S is closed. Then the gradient descent method with exact line search converges. Moreover, if the desired tolerance is \epsilon>0 , then

    f(x^{(k)})- p^\star \leq \epsilon

    holds after at most

    -\frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1 - m/M)}

    many iterations.
    Proof. Step 0.
    The main idea is to establish constants c \in (0,1) and A>0 such that

    f(x^{(k)}) - p^\star \leq c^k A

    for all k \geq 0 .
    Indeed, if this holds,

    \lim_{k} c^k = 0 \implies \lim_k f(x^{(k)}) =p^\star .



    For notational simplicity: we forgo indexing by iteration step.
    For given iterate x , write t_{\text{exact}} for resulting exact line search step size.
    Write x^+ = x - t_{\text{exact}}\nabla f(x) for the next iterate after x using gradient descent with exact line search.








    Step 1.
    Recall: under strong convexity assumptions on f , there holds

    \begin{aligned} \nabla^2 f(x) &\preceq M Id\\ f(y) &\leq f(x) + \nabla f(x)^T(y-x) + \frac{M}{2}\Vert x-y\Vert_2^2. \end{aligned}

    Letting y = x - t \nabla f(x) :

    \begin{aligned} f(x - t\nabla f(x)) &\leq f(x)  - t \Vert \nabla f(x) \Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2. \end{aligned}

    This holds for all t\geq0 with x - t \nabla f(x) \in S .







    Step 2.
    Using exact line search:

    \begin{aligned} t_{\text{exact}} &:= \text{argmin}\{ f(x- t\nabla f(x)): t \geq 0 \}\\ f(x^+)&=f(x-t_{\text{exact}}\nabla f(x)) \leq f(x - t\nabla f(x)) \end{aligned}

    for all t \geq 0 .







    Step 3.
    The convex quadratic

    \begin{aligned} f(x)  - t \Vert \nabla f(x) \Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2 \end{aligned}

    is minimized at

    t = \frac{1}{M}

    and so

    \begin{aligned} f(x) - \frac{1}{2M}\Vert \nabla f(x) \Vert_2^2 \leq f(x)  - t \Vert \nabla f(x) \Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2. \end{aligned}









    Step 4.
    Minimizing both sides of the Step 1. inequality

    \begin{aligned} f(x - t\nabla f(x)) &\leq f(x)  - t \Vert \nabla f(x) \Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2 \end{aligned}

    and using Steps 2. and 3. gives

    f(x^+) = f(x - t_{\text{exact}}\nabla f(x)) \leq f(x) - \frac{1}{2M} \Vert \nabla f(x) \Vert_2^2 .

    Therefore

    f(x^+) - p^\star \leq f(x) - p^\star - \frac{1}{2M} \Vert \nabla f(x) \Vert_2^2 .









    Step 5.
    Recall we derived

    f(x) - p^\star \leq \frac{1}{2m}\Vert \nabla f(x) \Vert_2^2 .

    (This was when we derived a natural stopping criterion.)
    Using this and Step 4. gives

    \begin{aligned}  f(x^+) - p^\star & \leq f(x) - p^\star  - \frac{1}{2M} \Vert \nabla f(x) \Vert_2^2\\ &\leq f(x) - p^\star - \frac{1}{2M}2m(f(x)-p^\star)\\ &=c (f(x)-p^\star) \end{aligned}

    where

    \begin{aligned} c = 1- \frac{m}{M}<1. \end{aligned}









    Step 6.
    Let

    x^{++} = x^+ - t'_{\text{exact}}\nabla f(x^+)

    denote the iterate following x^+ using gradient descent and exact line search to find t'_{\text{exact}}.
    Applying the analysis with (x^{++},x^{+}) in place of (x^{+},x), we conclude

    \begin{aligned}  f(x^{++}) - p^\star \leq &c (f(x^{+})-p^\star) \\ &\leq c (c (f(x) - p^\star))\\ &= c^2 (f(x) - p^\star). \end{aligned}

    Letting x^{(k)} denote the kth iterate, iterating this argument gives

    \begin{aligned} f(x^{(k)}) - p^\star \leq c^k (f(x^{(0)}) -p^\star) \end{aligned} .









    Step 7.
    Since c = 1 - \frac{m}{M} < 1 , conclude c^k \to 0 as k \to \infty .
    Therefore

    \begin{aligned} \lim_{k \to \infty } f(x^{(k)}) - p^\star \leq  \lim_{k \to \infty } c^k (f(x^{(0)}) -p^\star) = 0. \end{aligned}

    It follows that f(x^{(k)}) \to p^\star and so there holds convergence.








    Step 8.
    To find k such that

    f(x^{(k)}) - p^\star \leq \epsilon ,

    we solve

    \begin{aligned} c^k (f(x^{(0)}) -p^\star) \leq \epsilon  \end{aligned}

    in terms of k :

    \frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1/c)} \leq k.

    Therefore, as soon as k surpasses this number, the tolerance is satisfied.
    Recall c = 1 - \frac{m}{M}.


    Remarks.
    1. Recall: above we showed that

      \begin{aligned} \frac{1}{2M} \Vert \nabla f(x) \Vert_2^2 &\leq f(x) - p^\star \leq \frac{1}{2m}\Vert \nabla f(x) \Vert_2^2\\ \Vert x^\star - x \Vert_2 &\leq \frac{2}{m} \Vert \nabla f(x) \Vert_2. \end{aligned}

      It follows that f(x^{(k)}) - p^\star \to 0 implies x^{(k)} \to x^\star .
    2. We can use these a priori estimates to observe

      -\frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1 - m/M)} \leq -\frac{\log\left( \epsilon^{-1}m^{-1}2^{-1}\Vert \nabla f(x^{(0)}) \Vert_2^2 \right) }{\log(1 - m/M)} ,

    3. While the RHS is independent of knowing p^\star , it is conceivably a worse upper bound on the maximum number of steps required to achieve desired tolerance.








    Convergence for backtracking line search.

    We show
    • gradient descent with backtracking line search converges.
    • number of iterations needed to achieve tolerance f(x)-p^\star \leq \epsilon is bounded in terms of the problem data:
      • optimal value p^\star and initial value f(x^{(0)}),
      • desired tolerance \epsilon>0,
      • and the conditioning of \nabla^2 f(x) .








    Theorem. Suppose f is strongly convex with convexity constants m,M and its initial sublevel set S is closed. Then the gradient descent method with backtracking line search converges. Moreover, the Armijo-Goldstein inequality holds for 0 < t < \frac{1}{M} . Lastly, if the desired tolerance is \epsilon>0 , then

    f(x^{(k)})- p^\star \leq \epsilon

    holds after at most

    -\frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1 - \text{min}\{ 2m\alpha,2\alpha\beta m/M\})}

    many iterations.
    Proof. Step 0.
    The main idea is to establish constants c \in (0,1) and A>0 such that

    f(x^{(k)}) - p^\star \leq c^k A

    for all k \geq 0 .
    Indeed, if this holds,

    \lim_{k} c^k = 0 \implies \lim_k f(x^{(k)}) =p^\star .









    Step 1.
    Let 0<t<\frac{1}{M} be fixed but arbitrary and let x^+ = x + t \Delta x.
    In the next step, we will show the Armijo-Goldstein inequality holds whenever 0 < t < \frac{1}{M}.
    Beforehand: since gradient descent chooses the descent direction

    \Delta x = - \nabla f(x) ,

    we have

    \nabla f(x)^T \Delta x = - \nabla f(x)^T \nabla f(x) = -\Vert \nabla f(x) \Vert_2^2

    and so the Armijo-Goldstein inequality takes the form

    f(x^+) - f(x) \leq -\alpha t \Vert \nabla f(x) \Vert_2^2.









    Step 2.
    Recall strong convexity and S closed gave

    f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{M}{2}\Vert y-x \Vert_2^2.

    Taking y = x^+ = x - t\nabla f(x) gives

    f(x^+) \leq f(x) -t \Vert \nabla f(x)\Vert_2^2 + \frac{Mt^2}{2}\Vert \nabla f(x) \Vert_2^2.

    Noting

    \begin{aligned} 0 < t < \frac{1}{M} &\implies  Mt^2 \leq t\\ & \implies -t + \frac{Mt^2}{2} \leq - \frac{t}{2}, \end{aligned}

    and using \alpha \in (0,0.5) , we conclude

    \begin{aligned} f(x^+) &\leq f(x) - \frac{t}{2}\Vert \nabla f(x) \Vert_2^2\\ &\leq f(x) - \alpha t \Vert \nabla f(x) \Vert_2^2. \end{aligned}

    In conclusion: if 0<t<\frac{1}{M}, then the Armijo-Goldstein inequality

    f(x^+) - f(x) \leq - \alpha t \Vert \nabla f(x) \Vert_2^2

    holds.







    Step 3.
    By preceding step: the backtracking line search terminates once 0 < t < \frac{1}{M} or at the initial step with t = 1 .
    Supposing
    • \frac{1}{M} < 1
    • line search does not terminate at t = 1,
    then line search terminates for some t \geq \frac{\beta}{M}.
    Indeed, let k be the largest integer such that the line search does not terminate at t= \beta^k .
    Then \beta^k \geq \frac{1}{M} and termination happens at t'=\beta^{k+1}, which consequently satisfies t' \geq \frac{\beta}{M}.
    The claim follows.







    Step 4.
    If backtracking line search terminates at t=1, then

    f(x^+) \leq f(x) - \alpha \Vert\nabla f(x) \Vert_2^2 .

    If backtracking line search terminates for some t' \geq \frac{\beta}{M}, then

    \begin{aligned}  f(x^+) &\leq f(x) - \alpha t' \Vert\nabla f(x) \Vert_2^2 \\ &\leq f(x) - \frac{\alpha \beta}{M} \Vert \nabla f(x) \Vert_2^2. \end{aligned}

    Therefore, if either t=1 or t=t', then

    f(x^+) \leq f(x) - \text{min}\left\{ \alpha, \frac{\alpha\beta}{M} \right\} \Vert\nabla f(x) \Vert_2^2 .









    Step 5.
    Recall

    2m(f(x) - p^\star) \leq \Vert \nabla f(x) \Vert_2^2 .

    The preceding step thus gives

    \begin{aligned} f(x^+) - p^\star &\leq f(x) - p^\star - \text{min}\left\{ \alpha, \frac{\alpha\beta}{M} \right\} \Vert\nabla f(x) \Vert_2^2\\ & \leq f(x) - p^\star - \text{min}\left\{ \alpha, \frac{\alpha\beta}{M} \right\} \cdot 2m(f(x)-p^\star)\\ &\leq \left( 1 - \text{min}\left\{ 2m\alpha, \frac{2m\alpha\beta}{M} \right\} \right) (f(x) - p^\star)\\ &:= c(f(x)-p^\star), \end{aligned}

    where

    c= 1 - \text{min}\left\{ 2m\alpha, \frac{2m\alpha\beta}{M} \right\} < 1.

    Iterating the argument gives

    f(x^{(k)}) - p^\star \leq c^k (f(x^{(0)}) - p^\star)

    and whence the desired convergence since c^k \to 0 .







    Step 6.
    To find k such that

    f(x^{(k)}) - p^\star \leq \epsilon ,

    we solve

    \begin{aligned} c^k (f(x^{(0)}) -p^\star) \leq \epsilon  \end{aligned}

    in terms of k :

    \frac{\log\left( \epsilon^{-1}(f(x^{(0)}) - p^\star) \right)}{\log(1/c)} \leq k.

    Therefore, as soon as k surpasses this number, the tolerance is satisfied.
    Recall

    c=1 - \text{min}\left\{ 2m\alpha, \frac{2m\alpha\beta}{M} \right\} .











    Remark.
    The remarks after the convergence theorem for exact line search apply to this convergence theorem for backtracking line search.














    On the Condition Number Objectives
    1. Define condition numbers for matrices and convex subsets.
    2. Establish connection between strong convexity of a function and the condition number on its (convex) sublevel sets.
    3. Prepare for understanding how conditioning on the Hessian is important for convergence in gradient descent algorithms.








    Condition Number of a Matrix
    Given a matrix A \in \boldsymbol{S}_{++}^n , let

    \begin{aligned} \lambda_{\text{min}}(A) &= \text{ minimum eigenvalue of } A\\ \lambda_{\text{max}}(A) &= \text{ maximum eigenvalue of } A. \end{aligned}

    Condition number: the ratio

    \kappa(A) := \frac{\lambda_{\text{max}}(A)}{\lambda_{\text{min}}(A)}.









    Remarks.
    1. There holds \kappa(A^{-1}) = \kappa(A).
      Indeed:

      \begin{aligned} \lambda_{\text{min}}(A^{-1}) &= \frac{1}{\lambda_{\text{max}}(A)}\\ \lambda_{\text{max}}(A^{-1}) &= \frac{1}{\lambda_{\text{min}}(A)} \end{aligned}

      and so

      \begin{aligned} \kappa(A^{-1})=\frac{\lambda_{\text{max}}(A^{-1})}{\lambda_{\text{min}}(A^{-1})} = \frac{1/\lambda_{\text{min}}(A)}{1/\lambda_{\text{max}}(A)} = \frac{\lambda_{\text{max}}(A)}{\lambda_{\text{min}}(A)} = \kappa(A). \end{aligned}









    2. For c>0 , there holds \kappa(cA) =  \kappa(A).
      Indeed:

      \begin{aligned} \lambda_{\text{min}}(cA) &= c \lambda_{\text{max}}(A)\\ \lambda_{\text{max}}(cA) &= c \lambda_{\text{min}}(A) \end{aligned}

      and so

      \kappa(cA) = \frac{\lambda_{\text{max}}(cA)}{\lambda_{\text{min}}(cA)} = \frac{c\lambda_{\text{max}}(A)}{c\lambda_{\text{min}}(A)} = \frac{\lambda_{\text{max}}(A)}{\lambda_{\text{min}}(A)} = \kappa(A).









    Condition Number of a Convex Subset
    Let K \subset \mathbb{R}^n be a convex subet.
    Directional Width: given a direction

    \begin{aligned} \nu & \in \mathbb{R}^n\\ \Vert \nu \Vert_2 &= 1, \end{aligned}

    the number

    \begin{aligned} W(K,\nu) := \sup\{\nu^T x : x \in K\} - \inf\{ \nu^Tx : x \in K \}. \end{aligned}









    Remarks.
    1. W(K,\cdot) provides a means of measuring relative eccentricity between any two directions; e.g.,

      W(K,\nu_1)>W(K,\nu_2)

      implies K is elongated more in the direction \nu_1 than \nu_2.
    2. The numerical value W(K,\nu) is independent of the placement of K.
      Justification. Given x_0 \in \mathbb{R}^n , let

      K + x_0 = \{x + x_0 : x \in K \} .

      Then

      \begin{aligned} W(K+x_0,\nu)&=\sup\{\nu^Tx : x \in K + x_0 \} - \inf\{\nu^Tx : x \in K + x_0 \} \\ &= \sup\{\nu^T(x+x_0) : x \in K  \} - \inf\{\nu^T(x+x_0) : x \in K \}\\ &= \sup\{\nu^Tx : x \in K \} + \nu^T x_0 - \inf\{\nu^Tx:x \in K\} - \nu^T x_0\\ &= \sup\{\nu^Tx : x \in K \} - \inf\{\nu^Tx:x \in K\}\\ &= W(K,\nu). \end{aligned}









    Examples:
    1. Let

      B(0,r) = \{ x \in \mathbb{R}^n : \Vert x \Vert_2 < r \}

      be the open ball of radius r>0 and center 0 \in \mathbb{R}^n.
      Then

      \begin{aligned} W(B(0,r),\nu) &= \sup\{\nu^Tx : \Vert x \Vert_2 < r \} - \inf\{\nu^Tx: \Vert x \Vert_2 < r\}\\ &= \nu^T (r\nu) - \nu^T(-r\nu)\\ &= 2r. \end{aligned}

      As expected: directional widths of Euclidean balls are independent of the direction \nu.








    2. Let K_1,K_2 \subset \mathbb{R}^n be convex subsets with K_1 \subset K_2 .
      Using

      \begin{aligned} \sup\{\nu^T x: x \in K_1 \} \leq \sup\{\nu^T x : x \in K_2\}\\ \inf\{\nu^T x: x \in K_1 \} \geq \inf\{\nu^T x : x \in K_2\}\\ \end{aligned}

      we have

      \begin{aligned} W(K_1,\nu) &= \sup\{\nu^T x: x \in K_1 \}  - \inf\{\nu^T x: x \in K_1 \} \\ &\leq \sup\{\nu^T x : x \in K_2\} - \inf\{\nu^T x : x \in K_2\}\\ &=W(K_2,\nu). \end{aligned}

      As expected: larger sets have larger directional widths








    Extremal Widths: the maximum and minimum widths are defined as

    \begin{aligned} \text{maximum width} &= W_{\text{max}}(K) = \sup\{W(K,\nu): \Vert \nu\Vert_2 = 1 \}\\ \text{minimum width} &= W_{\text{min}}(K) = \inf\{W(K,\nu): \Vert \nu\Vert_2 = 1 \}. \end{aligned}

    Condition number: the ratio

    \kappa(K) = \left(\frac{W_{\text{max}}(K)}{W_{\text{min}}(K)} \right)^2.

    Remark.
    \kappa(K) >1 measures a lack of symmetry and indicates K is thin (or elongated) in some preferred direction; however, \kappa(K)=1 does not indicate K is a ball.







    Examples.
    1. Let

      B(0,r) = \{ x \in \mathbb{R}^n : \Vert x \Vert_2 < r \}

      be the open ball of radius r>0 and center 0 \in \mathbb{R}^n.
      As shown above,

      W(B(0,r),\nu) = 2r

      is constant relative to \nu.
      Thus

      \begin{aligned} W_{\text{max}}(B(0,r)) = \sup\{W(B(0,r),\nu): \Vert \nu\Vert_2 = 1 \} = 2r\\ W_{\text{min}}(B(0,r)) = \inf\{W(B(0,r),\nu): \Vert \nu\Vert_2 = 1 \} = 2r\\ \end{aligned}

      and so

      \kappa(B(0,r)) = \left( \frac{W_{\text{max}}(B(0,r))}{W_{\text{min}}(B(0,r))} \right)^2 = \left( \frac{2r}{2r} \right)^2=1.

      N.B.: Euclidean balls are not the only convex sets satisfying \kappa = 1; e.g., Reuleaux triangles are of constant width.








    2. Let K_1,K_2,\Omega\subset \mathbb{R}^n be convex subsets with K_1 \subset \Omega \subset K_2 .
      From a previous example above, we have

      W(K_1,\nu) \leq W(\Omega,\nu) \leq W(K_2,\nu).

      Thus

      \begin{aligned} W_{\text{min}}(K_1) &= \inf W(K_1,\nu) \leq \inf W(\Omega,\nu) = W_{\text{min}}(\Omega)\\ W_{\text{max}}(\Omega) &= \sup W(\Omega,\nu) \leq \sup W(K_2,\nu) = W_{\text{max}}(K_2) \end{aligned}

      and so

      \begin{aligned} \kappa(\Omega) = \left( \frac{W_{\text{max}}(\Omega)}{W_{\text{min}}(\Omega)} \right)^2  \leq \left( \frac{W_{\text{max}}(K_2)}{W_{\text{min}}(K_1)} \right)^2  \end{aligned}.

      For Example, if K_1 = B(0,r_1) and K_2 = B(0,r_2), then

      \begin{aligned} \kappa(\Omega)  \leq \left( \frac{W_{\text{max}}(K_2)}{W_{\text{min}}(K_1)} \right)^2 = \left( \frac{2r_2}{2r_1} \right)^2 = \frac{r_2^2}{r_1^2} \end{aligned}.









    3. Given Q \in \boldsymbol{S}_{++}^n and x_0 \in \mathbb{R}^n, define the set

      \mathcal{E} = \{ x \in \mathbb{R}^n : (x-x_0)^TQ^{-1}(x-x_0) \leq 1 \}.

      N.B.: \mathcal{E} is an ellipsoid and all ellipsoids can be described like this.
      We set out to determine the condition number of \mathcal{E}.
      In fact, we prove the following.

      Proposition. Let \mathcal{E} and Q be as above. Then their condition numbers are the same:

      \kappa(\mathcal{E}) = \kappa(Q) = \kappa(Q^{-1}).

      Proof. Step 0.
      As observed above, we take WLOG x_0=0.
      We need to compute the directional widths W(\mathcal{E},\nu) and therefore need to first compute

      \begin{aligned} \sup\{ \nu^T x: x \in \mathcal{E}\} \quad \text{ and } \quad \inf\{ \nu^T x : x \in \mathcal{E} \}. \end{aligned}

      This can be achieved by solving the optimization problems

      \begin{cases} \text{minimize} & \pm \nu^Tx\\ \text{subject to}& x^TQ^{-1}x \leq 1 \end{cases}.

      We employ Lagrangian duality.
      N.B.: these problems are convex and satisfy Slater’s condition.







      Step 1.
      The Lagrangian is

      L(x,\lambda) = \pm \nu^Tx + \lambda (x^T Q^{-1} x  - 1).

      Its gradient is

      \nabla_x L(x,\lambda) = \pm \nu + 2\lambda Q^{-1} x.

      The KKT conditions demand

      \begin{aligned} \lambda( x^T Q^{-1} x - 1) &= 0\\\ \nabla_x L &= 0 \end{aligned}

      Observe that \lambda = 0 would imply \nu=0, which is a contradiction.
      Thus \lambda>0 and so complementary slackness implies

      x^T Q^{-1} x  = 1

      and \nabla L=0 gives

      x = \mp \frac{1}{2\lambda} Q\nu.









      Step 2.
      Using

      \begin{aligned} 1&=x^T Q^{-1} x  \\ x &= \mp \frac{1}{2\lambda} Q\nu \end{aligned}

      we have

      \begin{aligned} 1 &= \left(\mp\frac{1}{2\lambda}Q\nu \right)^T Q^{-1} \left(\mp\frac{1}{2\lambda}Q\nu \right)\\ &=\frac{1}{4\lambda^2}\nu^T Q \nu. \end{aligned}

      Solving for \lambda gives

      \lambda =\frac{1}{2} \sqrt{\nu^T Q \nu}= \frac{1}{2} \sqrt{\nu^T Q^{1/2} Q^{1/2}\nu} = \frac{1}{2} \Vert Q^{1/2}\nu \Vert_2.









      Step 3.
      Using our evaluation of \lambda gives

      \begin{aligned} x^\star &= \mp \frac{1}{2\lambda}Q\nu\\ &= \mp \frac{1}{\Vert Q^{1/2}\nu\Vert_2} Q\nu\\ p^\star &= \pm \nu^T x^\star\\ &= - \frac{1}{\Vert Q^{1/2}\nu\Vert_2} \nu^TQ\nu\\ &= - \Vert Q^{1/2}\nu\Vert_2. \end{aligned}

      This gives

      \begin{aligned} \sup\{ \nu^Tx : x \in \mathcal{E}\} &= - \inf\{ - \nu^T x : x \in \mathcal{E} \}\\ &=\Vert Q^{1/2}\nu\Vert_2\\ \inf\{ \nu^Tx : x \in \mathcal{E}\} &= -\Vert Q^{1/2}\nu\Vert_2. \end{aligned}









      Step 4.
      We can now compute the directional and extremal widths.
      From Step 3. we have

      \begin{aligned} W(\mathcal{E},\nu) & = \sup\{ \nu^Tx : x \in \mathcal{E}\} - \inf\{ \nu^Tx : x \in \mathcal{E}\}\\ &=\Vert Q^{1/2}\nu\Vert_2 - (-\Vert Q^{1/2}\nu\Vert_2)\\ &= 2 \Vert Q^{1/2}\nu\Vert_2, \end{aligned}

      and so

      \begin{aligned} W_{\text{min}}(\mathcal{E}) &= \inf\{2 \Vert Q^{1/2}\nu\Vert_2: \Vert \nu\Vert_2 = 1\}\\ &= 2 \sqrt{\lambda_{\text{min}}(Q)}\\ W_{\text{max}}(\mathcal{E}) &= \inf\{2 \Vert Q^{1/2}\nu\Vert_2: \Vert \nu\Vert_2 = 1\}\\ &= 2 \sqrt{\lambda_{\text{max}}(Q)}. \end{aligned}

      At last, we compute the condition number of \mathcal{E}:

      \kappa(\mathcal{E}) = \left(\frac{W_{\text{max}}(\mathcal{E})}{W_{\text{min}}(\mathcal{E})} \right)^2 = \frac{\lambda_{\text{max}}(\mathcal{Q})}{\lambda_{\text{min}}(\mathcal{Q})} = \kappa(Q)

      which is what we wanted to show.








    Strong Convexity and Conditioning
    Let f:\mathbb{R}^n \to \mathbb{R} be strongly convex and have closed sublevel set S = \{ x : f(x) \leq f(x^{(0)})\} for some x^{(0)} \in \text{dom}\,f.
    Recall: it follows that

    mId \preceq \nabla^2f(x) \preceq M Id

    for some 0<m \leq M < \infty .

    For p^\star < \alpha \leq f(x^{(0)}) , let

    S_\alpha = \{ x : f(x) \leq \alpha \}

    be the sublevel set corresponding to \alpha.
    We set out to prove that the condition number \kappa(S_\alpha) is controlled by the conditioning of \nabla^2 f(x) .








    Remark.
    The condition number of \nabla^2 f(x) depends on x .
    However,

    mId \preceq \nabla^2f(x) \preceq M Id

    gives the upper estimate of

    \kappa(\nabla^2 f(x)) \leq \frac{M}{m}

    for all x \in S.
    Important points:
    1. M can be taken to be arbitrarily large and m arbitrarily small, this estimate can be very bad.
    2. M depends on S and hence the initial choice x^{(0)}.
      The further x^{(0)} is chosen from x^\star, the worse (larger) M can be.








    Little example.
    Let f(x_1,x_2) = e^{\frac{1}{2}(x_1^2+x_2^2)} .
    One computes

    \begin{aligned} e^{\frac{1}{2}(x_1^2+x_2^2)} Id& \preceq \nabla^2 f(x) \\ &=  e^{\frac{1}{2}(x_1^2+x_2^2)} \begin{bmatrix} 1+x_1^2 & x_1x_2\\ x_1x_2 & 1+x_2^2 \end{bmatrix}\\ & \preceq e^{\frac{1}{2}(x_1^2+x_2^2)}(1+x_1^2+x_2^2) Id \end{aligned}

    with the inequalities given by the extremal eigenvalues.
    Evidently \kappa(\nabla^2 f(x)) depends on x.
    If x^{(0)}=(1,1) , then S = \{ x_1^2+x_2^2 \leq 2 \} and so we may take

    \begin{aligned}  Id& \preceq \nabla^2 f(x)  \preceq 3e Id \end{aligned}

    with m=1,M=3e .







    Proposition. Let f, m , M, \alpha, S_\alpha be as above. Then

    \kappa(S_\alpha) \leq \frac{M}{m} .

    Remark.
    Thus, the better conditioned \nabla^2 f(x) is in the sense

    M/m \approx 1 ,

    then the more conditioned its sublevel sets are.
    Conversely, if the sublevel sets of f are poorly conditioned in the sense

    \kappa(S_\alpha) \gg 1 ,

    then \nabla^2 f(x) will be poorly conditioned, i.e.,

    M \gg m .

    Proof. Step 1.
    Using

    mId \preceq \nabla^2f(x) \preceq M Id

    and

    \begin{aligned} f(y) = f(x) + \nabla f(x)^T(y-x) + \frac{1}{2}(y-x)^T \nabla^2 f(z)(y-x)  \end{aligned}

    for suitable z , we have (taking x=x^\star):

    \frac{m}{2}\Vert y-x^\star \Vert_2^2 + p^\star \leq f(y) \leq \frac{M}{2}\Vert y-x^\star \Vert_2^2 + p^\star

    for y \in S.






    Let

    \begin{aligned} B_i &= \left\{ y \in S: \Vert y-x^\star \Vert_2 \leq \sqrt{\frac{2(\alpha - p^\star)}{M}} \right\}\\ B_o &= \left\{ y \in S: \Vert y-x^\star \Vert_2 \leq \sqrt{\frac{2(\alpha - p^\star)}{m}} \right\} \end{aligned}

    We show

    B_i \subset S_\alpha \subset B_o .

    By observations above, the conditioning of S_\alpha is estimated in terms of the extremal widths of B_i and B_o .







    Step 2.
    To begin, let y \in B_i .
    Using

    \Vert y - x^\star \Vert_2 \leq \sqrt{\frac{2(\alpha-p^\star)}{M}}

    we have

    \begin{aligned} f(y) &\leq \frac{M}{2}\Vert y-x^\star \Vert_2^2 + p^\star\\ &\leq \frac{M}{2}\frac{2(\alpha-p^\star)}{M} + p^\star\\ &=\alpha \end{aligned}

    and so y \in S_\alpha.








    Step 3.
    For the other containment, let y \in S_\alpha .
    Using

    f(y) \leq \alpha

    we have

    \begin{aligned} \frac{m}{2}\Vert y - x^\star \Vert_2^2 + p^\star \leq \alpha \end{aligned}

    and whence

    \begin{aligned} \Vert y - x^\star \Vert_2 \leq \sqrt{\frac{2(\alpha-p^\star)}{m}}, \end{aligned}

    thereby establishing y \in B_o.







    Step 4.
    By observations made above, we use B_i \subset S_\alpha \subset B_o to conclude

    \begin{aligned} \kappa(S_\alpha) &= \left(\frac{W_{\text{max}}(S_\alpha)}{W_{\text{min}}(S_\alpha)} \right)^2\\ &\leq \left(\frac{W_{\text{max}}(B_o)}{W_{\text{min}}(B_i)} \right)^2\\ &\leq \left(\frac{\sqrt{\frac{2(\alpha-p^\star)}{m}}}{\sqrt{\frac{2(\alpha-p^\star)}{M}}} \right)^2\\ &=\frac{M}{m}. \end{aligned}









    Proposition. Let f, m , M, \alpha, S_\alpha be as above. Then

    \begin{aligned} \lim_{\alpha \to p^\star } \kappa(S_\alpha) = \kappa\left(\nabla^2 f(x^\star) \right). \end{aligned}

    Remark.
    The point is that, as the sublevel sets shrink to x^\star , the problem’s conditioning is dictated by the conditioning of \nabla^2 f(x^\star).
    Proof (sketch). Step 1.
    Using Taylor approximation at x^\star , there holds

    \begin{aligned} f(y) &\approx  f(x^\star) + \nabla f(x^\star)^T (y-x^\star) + \frac{1}{2}(y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star)\\ &= p^\star + \frac{1}{2}(y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star) \end{aligned}

    for y near x^\star .







    Step 2.
    Observing that

    \bigcap_{\alpha>p^\star} S_\alpha = S_{p^\star} = \{ x^\star \},

    we can choose \alpha near p^\star so that y \in S_\alpha is near x^\star .
    Then the above Taylor approximation concludes y \in S_\alpha iff

    \begin{aligned} \alpha \gtrsim p^\star + \frac{1}{2}(y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star) \end{aligned}

    which holds iff

    \begin{aligned} 2(\alpha-p^\star) \gtrsim (y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star). \end{aligned}









    Step 3.
    Previous step indicates: if y \in S_\alpha then y belongs to or nearly belongs to

    \begin{aligned} \{ y : (y-x^\star)^T \nabla^2 f(x^\star)(y-x^\star) \leq 2(\alpha-p^\star) \}, \end{aligned}

    which is an ellipsoid defined by the matrix

    \left(\nabla^2 f(x^\star) \right)^{-1}

    and so

    \kappa(S_\alpha) \approx \kappa(\nabla^2 f(x^\star)) .

    (Recall \kappa(A) = \kappa(A^{-1}).)
    Taking \alpha closer to p^\star evidently improves these approximations and so we conclude

    \begin{aligned} \lim_{\alpha \to p^\star } \kappa(S_\alpha) = \kappa\left(\nabla^2 f(x^\star) \right). \end{aligned}
















    Example (See also CO Example 9.3.2)
    Let

    \begin{aligned}  f(x_1,x_2) = e^{\frac{1}{2}(x_1^2+x_2^2)}. \end{aligned}

    Goal
    Argue how changing conditioning of sublevel sets of f affect convergence.






    Sublevel set
    Observe

    \begin{aligned} S_\alpha &= \{e^{\frac{1}{2}(x_1^2+x_2^2)} \leq \alpha \}\\ &= \{ x_1^2 + x_2^2 \leq 2\log \alpha \} \end{aligned}

    are disks centered at the origin and with radius \sqrt{2\log\alpha}.
    As such, the sublevel sets are well-conditioned:

    \kappa(S_\alpha)=1.







    Conditioning of the Hessian
    We compute

    \begin{aligned} \nabla f(x_1,x_2)  &=  e^{\frac{1}{2}(x_1^2+x_2^2)} \begin{bmatrix} x_1\\x_2 \end{bmatrix}\\ \nabla^2 f(x_1,x_2) &= e^{\frac{1}{2}(x_1^2+x_2^2)} \begin{bmatrix} 1+x_1^2 & x_1x_2\\ x_1x_2 & 1+x_2^2 \end{bmatrix} \end{aligned}.

    The extremal eigenvalues of \nabla^2 f(x_1,x_2) are

    \begin{aligned} \lambda_{\text{min}}(\nabla^2 f(x_1,x_2)) &= e^{\frac{1}{2}(x_1^2+x_2^2)}\\ \lambda_{\text{max}}(\nabla^2 f(x_1,x_2)) &= e^{\frac{1}{2}(x_1^2+x_2^2)}(1+x_1^2+x_2^2). \end{aligned}

    Therefore

    \kappa(\nabla^2 f(x_1,x_2)) = \frac{\lambda_{\text{max}}(\nabla^2 f(x_1,x_2))}{\lambda_{\text{min}}(\nabla^2 f(x_1,x_2))} = 1 + x_1^2+x_2^2

    and so \nabla^2 f is reasonably well-conditioned near x^\star = (0,0).






    Let x^{(0)} = (1,1) be the initial point and define the initial sublevel set

    S = \{ f(x) \leq f(x^{(0)}) = e\}.

    On S, we have

    \begin{aligned} m=1 &\leq e^{\frac{1}{2}(x_1^2+x_2^2)} = \lambda_{\text{min}}(\nabla^2 f(x_1,x_2)) \\ \lambda_{\text{max}}(\nabla^2 f(x_1,x_2)) &= e^{\frac{1}{2}(x_1^2+x_2^2)}(1+x_1^2+x_2^2)  \leq 3e =M. \end{aligned}

    Thus f satisfies the strong convexity bounds

    \begin{aligned} Id \preceq \nabla^2 f(x_1,x_2) \preceq 3e Id, \quad (x_1,x_2) \in S. \end{aligned}







    Applying Gradient Descent
    Using gradient descent with exact line search solves the problem in one step.
    This is due to radiality of sublevel sets and \nabla f and f being minimized at the origin.
    Using backtracking line search, the problem is solved without issue.







    Unconditioning the Problem
    Consider the anisotropic dilation: (x_1,x_2) \mapsto (\gamma x_1,x_2) for an arbitrary \gamma > 1 .

    Applying this dilation to f results in the function

    f_\gamma (x_1,x_2) = f(\gamma x_1,x_2) = e^{\frac{1}{2}(\gamma^2 x_1^2 + x_2^2)}.

    Observe: the sublevel sets of f_\gamma are of the form

    \begin{aligned} S_\alpha &= \{ e^{\frac{1}{2}(\gamma^2 x_1^2 + x_2^2)} \leq \alpha \}\\ &=\{ \gamma^2 x_1^2 + x_2^2 \leq  2\log \alpha\}, \end{aligned}

    which is an ellipse whose condition number depends on \gamma.
    Evidently, the larger \gamma is, the worse S_\alpha is conditioned.
    N.B.: by the anistropy, f_\gamma is more sensitive to change in x_1 than x_2.







    Now compute

    \nabla f_\gamma(x) = e^{\frac{1}{2}(\gamma^2 x_1^2 + x_2^2)} \begin{bmatrix} \gamma^2 x_1\\ x_2 \end{bmatrix}.

    The effects of poor conditioning can already be observed: for large \gamma>1, considering the step

    x^+ = x - t e^{\frac{1}{2}(\gamma^2 x_1^2 + x_2^2)} \begin{bmatrix} \gamma^2 x_1\\ x_2 \end{bmatrix},

    we see x^+ is obtained from x by stepping significantly further in the x_1-coordinate than the x_2-coordinate.
    This is despite the fact that the minimizer is still at the origin.







    Applying Gradient Descent
    While exact line search will still find the minimizer for this problem quickly (2 steps), one finds backtracking line search becomes impractical for large \gamma .
    Recalling that Armijo-Goldstein’s inequality is satisfied for t<1/M, this is unsurprising since the larger M is, the smaller t is likely needed to be in order for the Armijo-Goldstein's inequality to hold.














    Steepest Descent Gradient Descent as Steepest Descent:
    Recall: if t>0 is small and \Vert \nu \Vert_2=1, then Taylor’s approximation gives

    f(x+t\nu) - f(x) \approx t \nabla f(x)^T \nu.

    If \nu is a descent direction, then

    t \nabla f(x)^T \nu<0

    and this quantity records the approximate decrease in f in the direction \nu upon taking the small step

    x=x + t \nu.









    If we want to decrease f efficiently, we ask: for given step size t>0 , which direction \nu effects greatest descent?









    Observing

    \begin{aligned} -\frac{1}{\Vert u \Vert_2} u = \text{argmin}\{  u^T \nu : \Vert \nu \Vert_2 = 1 \} \end{aligned}

    for any nonzero u \in \mathbb{R}^n , we conclude

    \begin{aligned} -\frac{1}{\Vert \nabla f(x) \Vert_2} \nabla f(x) = \text{argmin}\{  \nabla f(x)^T \nu : \Vert \nu \Vert_2 = 1 \}. \end{aligned}

    Therefore, the direction

    \begin{aligned} \Delta x_{\text{nsd}}:=-\frac{1}{\Vert \nabla f(x) \Vert_2} \nabla f(x)  \end{aligned}

    gives the direction of greatest decrease: for small t>0 we have

    \begin{aligned} f(x + t \Delta x_{\text{nsd}}) - f(x) & \approx t \nabla f(x)^T \Delta x_{\text{nsd}}\\ &\leq t \nabla f(x)^T \nu\\ & \approx f(x + t \nu  ) - f(x) . \end{aligned}

    One may call such a direction a steepest descent direction.







    Quadratic Norm Steepest Descent
    Suppose for \alpha near p^\star, the sublevel sets S_\alpha are poorly conditioned.
    Suppose there is a change of variable/coordinates \bar x = P^{1/2} x so that

    \bar f(\bar x) := f(P^{-1/2}\bar x) = f(x)

    has well-condition sublevel sets.
    Then gradient descent in \bar x-coordinates is likely to behave well when minimizing \bar f(\bar x).








    Compute

    \nabla_{\bar x} \bar f(\bar x) = P^{-1/2} (\nabla f)(P^{-1/2}\bar x) = P^{-1/2} \nabla f(x).

    Then the steepest descent direction in \bar x -coordinates is

    \Delta \bar x = - \frac{1}{\Vert  P^{-1/2} \nabla f(x)\Vert_2} P^{-1/2} \nabla f(x).

    Converting back to original coordinates x = P^{-1/2}\bar x:

    \begin{aligned} \Delta x &= P^{-1/2}\Delta \bar x = - \frac{1}{\Vert  P^{-1/2} \nabla f(x)\Vert_2} P^{-1} \nabla f(x).  \end{aligned}









    But then (in x-coordinates)

    \begin{aligned} -\frac{1}{\Vert  P^{-1/2} \nabla f(x)\Vert_2} P^{-1} \nabla f(x) &= - \frac{1}{\Vert  \nabla_{\bar x} \bar f(\bar x)\Vert_2}  \nabla \bar f(\bar x)\\ &= \text{argmin}\left\{ \nabla \bar f(\bar x)^T \bar\nu : \Vert \bar \nu \Vert_2=1\right\}\\ &= \text{argmin}\left\{ \left( P^{-1/2} \nabla f(x) \right)^T P^{1/2}\nu : \Vert P^{1/2} \nu \Vert_2=1\right\}\\ &= \text{argmin}\left\{  \nabla f(x)^T \nu : \Vert P^{1/2} \nu \Vert_2=1\right\}. \end{aligned}

    Therefore, the gradient descent direction obtained by the change of variable is the “steepest descent direction”

    \begin{aligned} \text{argmin}\left\{  \nabla f(x)^T \nu : \Vert P^{1/2} \nu \Vert_2=1\right\} \end{aligned}

    relative to the norm

    \Vert x \Vert_P := \Vert P^{1/2} x \Vert_2.









    In summary:
    • After a change of variable, the problem becomes well-conditioned and gradient descent may be used to obtain a steepest descent direction relative to the norm \Vert \cdot \Vert_2 (in the \bar x -coordinates).
    • Undoing the change of variable, this steepest descent direction is realized as the steepest descent relative to a different norm.









    The observations above suggest that a change of variable may result in better computational performance (via improving conditioning).
    Also indicated: gradient descent in the new variables was equivalent to finding the steepest descent direction with respect to the norm \Vert P^{1/2} \cdot \Vert_2 in the original variables.
    Motivated by this: consider steepest descent with respect to general norms.







    Review on Norms
    A norm on a vector space V is a function \Vert \cdot \Vert: V \to \mathbb{R}_{+} satisfying
    1. triangle inequality: \Vert x+y \Vert \leq \Vert x \Vert + \Vert y \Vert for all x,y \in V .
    2. Homogeneity: \Vert c x \Vert = |c| \Vert x \Vert for all c \in \mathbb{R} and x \in V .
    3. Positive definiteness: if x \in V satisfies \Vert x \Vert = 0, then x = 0 .








    Examples of important norms are:
    1. Standard Euclidean norm:

      \Vert x \Vert_2 = \sqrt{x_1^2 + \cdots + x_n^2}

    2. Quadratic norms for P \in \boldsymbol{S}_{++}^n:

      \Vert x \Vert_P = \Vert P^{1/2} x \Vert_2.

    3. \ell_p norms for 1 \leq p < \infty :

      \Vert x \Vert_p = \left( |x_1|^p + \cdots + |x_n|^p \right)^{1/p}.

    4. Chebyshev/\ell_\infty norm:

      \Vert x \Vert_\infty = \max\{ |x_1|,\ldots,|x_n| \}.









    Norm spheres: for a given norm \Vert \cdot \Vert , the unit norm sphere

    \{ x : \Vert x \Vert = 1 \}

    defines a collection of “directions” relative to the norm \Vert \cdot \Vert .







    Examples of unit norm spheres are indicated below.








    Given a norm \Vert \cdot \Vert , we define the dual norm \Vert \cdot \Vert_* to be

    \Vert x \Vert_* := \sup\{ x^T \nu : \Vert \nu \Vert = 1 \} .

    (This is the “operator/matrix norm” of x^T .)

    Important examples are recorded in the following table.

    \Vert \cdot \Vert \Vert \cdot \Vert_*
    \Vert \cdot \Vert_2 \Vert \cdot \Vert_2
    \Vert \cdot \Vert_P := \Vert P^{1/2} \cdot \Vert_2, for P \in \boldsymbol{S}_{++}^n \Vert \cdot \Vert_{P^{-1}} := \Vert P^{-1/2} \cdot \Vert_2
    \Vert \cdot \Vert_1 \Vert \cdot \Vert_\infty
    \Vert \cdot \Vert_\infty \Vert \cdot \Vert_1
    \Vert \cdot \Vert_p , for 1 < p < \infty \Vert \cdot \Vert_q, where \frac{1}{p}+\frac{1}{q} = 1









    General Steepest Descent
    Fix a norm \Vert \cdot \Vert on \mathbb{R}^n.

    Normalized steepest descent direction for \Vert \cdot \Vert : any element

    \Delta x_{\text{nsd}} \in \text{argmin}\{ \nabla f(x)^T \nu : \Vert \nu \Vert = 1 \}.



    Intuition: the step x=x+t\Delta x_{\text{nsd}} (for small t>0) effects the greatest decrease in the objective among all directions \nu satisfying \Vert \nu \Vert =1 .
    Recall:

    f(x+t\nu) - f(x) \approx t \nabla f(x)^T \nu.









    In practice:
    1. \Vert \cdot \Vert is chosen depending on problem; however, gradient descent and Newton’s method (given below) are common.
    2. Convergence results may depend on choice of \Vert \cdot \Vert .
    3. \Delta x_{\text{nsd}} is generally not unique; e.g. this can occur for \Vert \cdot \Vert_1 , as indicated below.
      Here, the red “x” indicates -\nabla f(x) and the black dots indicate two distinct \Delta x_{\text{nsd}} .
      In fact: \Delta x_{\text{nsd}} may be taken to be any point on the unit norm sphere in the first quadrant.








    N.B.: by

    \Delta x_{\text{nsd}} \in \text{argmin}\{\nabla f(x)^T \nu : \Vert \nu \Vert =1\}

    we have

    \begin{aligned} \nabla f(x)^T \Delta x_{\text{nsd}} &= \inf\{ \nabla f(x)^T \nu : \Vert \nu \Vert = 1 \} \\ &= -\sup\{ -\nabla f(x)^T \nu : \Vert \nu \Vert = 1 \} \\ &= -\Vert \nabla f(x)\Vert_*. \end{aligned}

    Recall:

    \Vert x \Vert_* = \sup\{ x^T \nu : \Vert \nu \Vert = 1\}.









    Unnormalized steepest descent direction: for any \Delta x_{\text{nsd}} , a descent direction of the form

    \Delta x_{\text{sd}} = \Vert \nabla f(x) \Vert_* \Delta x_{\text{nsd}}.



    N.B.:
    1. This choice of “unnormalization” gives

      \nabla f(x)^T \Delta x_{\text{sd}} =-\Vert \nabla f(x) \Vert_*^2.

    2. Using \Delta x_{\text{sd}} instead of \Delta x_{\text{nsd}} uses both the direction of steepest descent (with respect to \Vert \cdot \Vert ) and the rate of decrease of f.
    3. Exact line search does not see choice of normalization.
      Indeed for c>0, we have

      \begin{aligned} \text{argmin}\{f(x+tc\Delta x): t \geq 0 \} = \frac{1}{c}\,\text{argmin}\{f(x+t\Delta x): t \geq 0 \}  \end{aligned}

    4. Theoretically: choice of normalization does not matter.
      Pragmatically: choice of normalization may affect behavior of convergence.








    Steepest Descent Method
    The general steepest descent method may now be recorded.
    
    given initial x \in \text{dom}\,f 
    repeat:
    1. Compute: \Delta x_{\text{sd}} .
    2. Perform line search to determine step size t .
    3. Take step: x:= x + t\Delta x_{\text{sd}} .
    until: stopping criterion holds.
    
    N.B.: for exact line search, either \Delta x_{\text{sd}} or \Delta x_{\text{nsd}} may be used.
















    Examples For various norms, we will compute

    \begin{aligned} &\Delta x_{\text{nsd}}  \\ &\Delta x_{\text{sd}} = \Vert \nabla f(x) \Vert_* \Delta x_{\text{nsd}}. \end{aligned}









    Example 1
    Let \Vert \cdot \Vert = \Vert \cdot \Vert_2 .
    Observe

    \begin{aligned} \Delta x_{\text{nsd}} &=\text{argmin}\{ \nabla f(x)^T\nu : \Vert \nu \Vert_2 = 1 \}\\ &= -\frac{1}{\Vert \nabla f(x) \Vert_2} \nabla f(x). \end{aligned}









    Next, using

    \Vert \cdot \Vert_* = \Vert \cdot \Vert_2

    we have

    \Vert \nabla f(x) \Vert_* = \Vert \nabla f(x) \Vert_2

    and so

    \begin{aligned} \Delta x_{\text{sd}} &= \Vert \nabla f(x) \Vert_* \Delta x_{\text{nsd}}\\ &=\Vert \nabla f(x) \Vert_2 \left( -\frac{1}{\Vert \nabla f(x) \Vert_2} \nabla f(x) \right) \\ &=- \nabla f(x). \end{aligned}


    Conclusion: gradient descent method is the steepest descent with respect to the Euclidean norm \Vert \cdot \Vert_2.







    Example 2
    Let P \in \boldsymbol{S}_{++}^n and let

    \Vert x \Vert = \Vert x \Vert_P := \Vert P^{1/2} x\Vert_2 .

    Compute

    \begin{aligned} \Delta x_{\text{nsd}} &= \text{argmin}\left\{  \nabla f(x)^T \nu : \Vert P^{1/2} \nu \Vert_2=1\right\}\\ &= P^{-1/2} \text{argmin}\left\{  \nabla f(x)^T P^{-1/2}\mu : \Vert \mu \Vert_2=1\right\}\\ &= P^{-1/2} \text{argmin}\left\{  (P^{-1/2}\nabla f(x))^T\mu : \Vert \mu \Vert_2=1\right\}\\ &= P^{-1/2}\left( - \frac{1}{\Vert  P^{-1/2} \nabla f(x)\Vert_2} P^{-1/2} \nabla f(x)  \right)\\ &=  - \frac{1}{\Vert  P^{-1/2} \nabla f(x)\Vert_2} P^{-1} \nabla f(x). \end{aligned}









    Using that the dual of \Vert \cdot \Vert_P is

    \Vert x \Vert_* = \Vert x \Vert_{P^{-1}} := \Vert P^{-1/2} x\Vert_2

    and so

    \Vert \nabla f(x) \Vert_* = \Vert P^{-1/2} \nabla f(x) \Vert_2 ,

    we conclude

    \begin{aligned} \Delta x_{\text{sd}} &= \Vert \nabla f(x) \Vert_* \Delta x_{\text{nsd}}\\ &= \Vert P^{-1/2} \nabla f(x) \Vert_2 \left( - \frac{1}{\Vert  P^{-1/2} \nabla f(x)\Vert_2} P^{-1} \nabla f(x) \right)\\ & = P^{-1}\nabla f(x)  \end{aligned} .









    Consider the change of variable y = P^{1/2} x and

    \bar f(y) = f(P^{-1/2}y) = f(x) .

    The gradient descent step for \bar f in the new variables y is

    \begin{aligned} \Delta y_{\text{sd}} &= - \nabla \bar f(y) \\ &= -\nabla_y (f(P^{-1/2}y))\\ &= - P^{-1/2} (\nabla f)(P^{-1/2}y) \end{aligned} .

    This direction in the original variables x is thus

    \begin{aligned} P^{-1/2}\Delta y_{\text{sd}} &= - P^{-1} \nabla f(x) = \Delta x_{\text{sd}}. \end{aligned} .









    Conclusion: obtaining \Delta x_{\text{sd}} with respect to \Vert \cdot \Vert_P is equivalent to standard gradient descent in a different coordinate system.
    N.B.: suitably chosen P may result in a well-conditioned problem in the new variables y = P^{1/2} x.







    Example 3
    Let

    \Vert x \Vert = \Vert x \Vert_1 = |x_1| + \cdots + |x_n|,

    and recall

    \Vert x \Vert_* = \Vert x \Vert_\infty = \max\{ |x_1|,\ldots,|x_n|\}.

    Given x , let i be such that

    \begin{aligned} \Vert \nabla f(x) \Vert_\infty &= \max\left\{\left\vert \frac{\partial f(x)}{\partial x_j} \right\vert: j=1,\ldots,n\right\}\\ &= \left\vert \frac{\partial f(x)}{\partial x_i} \right\vert \end{aligned}









    Let \left\{e_j\right\} denote the standard basis of \mathbb{R}^n.
    N.B.: \Vert \pm e_i \Vert_1 = 1.
    One can show

    \begin{aligned} \Delta x_{\text{nsd}} &= - \text{sgn}\left( \frac{\partial f(x)}{\partial x_i} \right) e_i. \end{aligned}

    Therefore, for \Vert \cdot \Vert_1 , the steepest descent direction \Delta x_{\text{nsd}} is in the coordinate direction in which f changes the most.







    Lastly, one has

    \begin{aligned} \Delta x_{\text{sd}} &= \Vert \nabla f(x) \Vert_\infty \Delta x_{\text{nsd}}\\ &= \left\vert \frac{\partial f(x)}{\partial x_i} \right\vert \left(- \text{sgn}\left( \frac{\partial f(x)}{\partial x_i} \right)  e_i \right)\\ &=-\frac{\partial f(x)}{\partial x_i}  e_i \end{aligned}

    The resulting steepest descent algorithm is often called a coordinate-descent algorithm.
    Indeed: the step

    x=x-t\frac{\partial f(x)}{\partial x_i}  e_i, \quad t>0

    amounts to increasing or decreasing the i th coordinate of x according to the coordinate direction of greatest change.














    Newton’s Method Suppose always that f is strongly convex and twice continuously differentiable.

    Newton step: the direction

    \Delta x_{\text{nt}} = - \nabla^2 f(x)^{-1} \nabla f(x).

    N.B.: if \nabla f(x) \neq 0 , then convexity implies

    \begin{aligned} \nabla f(x)^T \Delta x_{\text{nt}} &= - \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x)<0, \end{aligned}

    and so \Delta x_{\text{nt}} is a descent direction.
    (Recall strong convexity \implies \nabla^2 f(x) \in \boldsymbol{S}_{++}^n .)

    N.B.: Computing \Delta x_{\text{nt}} involves solving the system

    \begin{aligned} Hv &= -g,\\ H = \nabla^2 f(x), &\quad g = \nabla f(x). \end{aligned}

    Exploiting special matrix structure of H may lead to efficiently solving system.







    Example.
    Let

    f(x_1,x_2) = x_1^2 + e^{x_2}.

    Then

    \nabla f(x_1,x_2) =  \begin{bmatrix} 2x_1\\ e^{x_2} \end{bmatrix}, \qquad \nabla^2 f(x_1,x_2) = \begin{bmatrix} 2 & 0\\ 0 & e^{x_2} \end{bmatrix},

    and so

    \begin{aligned} \Delta x_{\text{nt}} &= - \nabla^2 f(x_1,x_2)^{-1} \nabla f(x_1,x_2)\\ &=- \begin{bmatrix} \frac{1}{2} & 0\\ 0 & e^{-x_2} \end{bmatrix} \begin{bmatrix} 2x_1\\ e^{x_2} \end{bmatrix} \\ &= \begin{bmatrix} -x_1\\ -1 \end{bmatrix} \end{aligned}









    Observe:
    • (x_1,x_2) + \Delta x_{\text{nt}} = (x_1,x_2) - (x_1,1) = (0,x_2-1) .
      Thus, first step minimizes the quadratic part x_1^2 of f:

      f(0,x_2-1) = e^{x_2-1}.

    • x^{(k)} = (0,x_2-k) for k>0 .
      Thus, sequence decreases the e^{x_2} part of f:

      f(x^{(k)}) = e^{x_2-k}.

    A couple of steps are plotted below.








    Interpretations of Newton Step
    Steepest Descent:
    Let

    P = \nabla^2 f(x).

    Consider the “Hessian norm”:

    \begin{aligned} \Vert y \Vert &:= \Vert y \Vert_P \\ &= \Vert P^{1/2} y \Vert_2\\ &= \left( (P^{1/2}y)^T P^{1/2}y \right)^{1/2}\\ &= \left( y^T P y \right)^{1/2}\\ &= \left( y^T \nabla^2 f(x) y \right)^{1/2}. \end{aligned}









    Recall: unnormalized steepest descent direction \Delta x_{\text{sd}} for quadratic norm \Vert \cdot\Vert_Q is

    \Delta x_{\text{sd}} = - Q^{-1}\nabla f(x) .

    Therefore, \Delta x_{\text{nt}} is the steepest descent direction \Delta x_{\text{sd}} for \Vert \cdot \Vert_{P} :

    \Delta x_{\text{nt}} = - \nabla^2 f(x)^{-1} \nabla f(x) = - P^{-1} \nabla f(x) .



    N.B.: finding \Delta x_{\text{nt}} is equivalent to finding \Delta x_{\text{sd}} after a change of variable that results in well conditioned level sets near optimizer.







    Minimizing Quadratic Approximation:
    For given x , define

    F(v) = f(x) + \nabla f(x)^T v + \frac{1}{2} v^T \nabla^2 f(x) v.

    Thus, F(v) \approx f(x+v) is the second order Taylor approximation of f at x .
    N.B.: F(v)  is a convex quadratic in v since \nabla^2 f(x) \in \boldsymbol{S}_{++}^n .
    Moreover, recalling

    \begin{aligned} P \in \boldsymbol{S}_{++}^n \implies \text{argmin}\{c + b^T v + \frac{1}{2} v^T P v\} =  -P^{-1}b \end{aligned}

    we conclude F(v) is minimized at

    v = - \nabla^2 f(x)^{-1} \nabla f(x) = \Delta x_{\text{nt}}.

    Viz., the step \Delta x_{\text{nt}} exactly minimizes the quadratic approximation.







    Conclusion:
    Given
    • \Delta x_{\text{nt}} minimizes the quadratic approximation F of f at x.
    • f(x+v) \approx F(v) is a good approximation for x \approx x^\star .
    we conclude
    • p^\star = f(x^\star) \approx F(\Delta x_{\text{nt}}) and x+\Delta x_{\text{nt}} \approx x^\star .
    • This conclusion is depicted below.








      The Newton Decrement
      The Newton decrement at x is the quantity

      \lambda(x) = \left( \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) \right)^{1/2}.

      Detailed below: \lambda(x) is used to estimate

      \begin{aligned} &f(x) - \inf_{v} F(v)\\ &f(x) - p^\star\\ &f(x+t\Delta x_{\text{nt}}) - f(x). \end{aligned}









      Proposition. There holds

      \begin{aligned} \lambda(x) &= \Vert \Delta x_{\text{nt}} \Vert_{\nabla^2 f(x)}. \end{aligned}

      Proof.

      \begin{aligned} \lambda(x) &= \left( \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) \right)^{1/2}\\ &= \left( \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla^2 f(x) \nabla^2 f(x)^{-1} \nabla f(x) \right)^{1/2}\\ &=\left( (-\nabla^2 f(x)^{-1}\nabla f(x))^T  \nabla^2 f(x) (-\nabla^2 f(x)^{-1} \nabla f(x) )\right)^{1/2}\\ &=\left(\Delta x_{\text{nt}}^T \nabla^2 f(x) \Delta x_{\text{nt}} \right)^{1/2}\\ &= \Vert \Delta x_{\text{nt}} \Vert_{\nabla^2 f(x)}. \end{aligned}









      Proposition. There holds

      \begin{aligned} f(x) - \inf_v F(v) &= \frac{1}{2}\lambda(x)^2. \end{aligned}

      Therefore, using

      \begin{aligned} x \approx x^\star &\implies f(x+v) \approx F(v)\\ & \implies F(\Delta x_{\text{nt}}) \approx p^\star \end{aligned}

      we conclude that \frac{1}{2}\lambda^2(x) \approx f(x)-p^\star.
      Proof.

      \begin{aligned} f(x) - \inf F(v) &= f(x) - F(\Delta x_{\text{nt}})\\ &= f(x) - f(x) - \nabla f(x)^T \Delta x_{\text{nt}} - \frac{1}{2}\Delta x_{\text{nt}}^T \nabla^2 f(x) \Delta x_{\text{nt}}\\ &=  \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) \\ &\qquad -\frac{1}{2}  \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla^2 f(x) \nabla^2 f(x)^{-1} \nabla f(x)\\ &=  \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) \\ &\qquad  - \frac{1}{2}  \nabla f(x)^T \nabla^2 f(x)^{-1}  \nabla f(x)\\ &=  \frac{1}{2}\nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x)\\ &= \frac{1}{2}\lambda(x)^2 \end{aligned}









      Proposition. Armijo’s condition for the Newton step is

      f(x+t\Delta x_{\text{nt}}) - f(x) \leq  -\alpha t  \lambda(x)^2

      Proof. Recall: Armijo’s condition for backtracking line search is

      f(x+t\Delta x) - f(x) \leq  \alpha t\nabla f(x)^T \Delta x.

      Taking

      \Delta x = \Delta x_{\text{nt}} = - \nabla^2 f(x)^{-1} \nabla f(x),

      we find

      \begin{aligned} \nabla f(x)^T \Delta x_{\text{nt}} &= \nabla f(x)^T \left( - \nabla^2 f(x)^{-1} \nabla f(x) \right)\\ &= -\nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x)\\ &= - \lambda(x)^2. \end{aligned}

      Thus Armijo’s condition for the Newton step is

      f(x+t\Delta x_{\text{nt}}) - f(x) \leq  -\alpha t  \lambda(x)^2









      Newton’s Method
      Using the Newton step and decrement, we may now record Newton’s method.
      Traditionally, “Newton’s method” uses step size t=1.
      The following is thus sometimes called a “damped Newton method.”

      
      given 
               initial x \in \text{dom}\,f 
               tolerance \epsilon>0
      repeat: 
      1. Compute 
               \Delta x_{\text{nt}} = - \nabla^2 f(x)^{-1}\nabla f(x) 
      and 
               \lambda^2 = \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) .
      2. Stopping criterion: quit if \frac{1}{2}\lambda^2 \leq \epsilon .
      3. Perform line search to determine step size t .
      4. Take step: x = x + t\Delta x_{\text{nt}} .
      


      N.B.: Stopping criterion comes from

      \begin{aligned} x \approx x^\star &\implies f(x+v) \approx F(v)\\ & \implies F(\Delta x_{\text{nt}}) \approx p^\star \end{aligned}

      which implies \frac{1}{2}\lambda^2(x) \approx f(x)-p^\star.







      Affine Invariance
      Proposition. If
      • T \in \mathbb{R}^{n\times n} is nonsingular
      • x = Ty
      • \bar f(y):= f(Ty)
      then the Newton step \Delta y_{\text{nt}} for \bar f is related to the Newton step \Delta x_{\text{nt}} for f by

      \Delta x_{\text{nt}} = T \Delta y_{\text{nt}} .

      Proof. Computing

      \begin{aligned} \nabla \bar f(y) &= T^T \nabla f(x)\\ \nabla^2 \bar f(y) &= T^T \nabla^2 f(x) T, \end{aligned}

      we find

      \begin{aligned} \Delta y_{\text{nt}} &= - \nabla^2 \bar f(y)^{-1} \nabla \bar f(y)\\ &=\left( T^T \nabla^2 f(x) T \right)^{-1} T^T \nabla f(x)\\ &=T^{-1} \nabla^2 f(x)^{-1}(T^T)^{-1}T^T \nabla f(x)\\ &=T^{-1}\nabla^2 f(x)^{-1}\nabla f(x)\\ &=T^{-1}\Delta x_{\text{nt}}. \end{aligned}









      Remark 1. This proposition asserts that an affine change of variable effects the same affine change in Newton steps.
      In particular:

      x+\Delta x_{\text{nt}} = Ty + T\Delta y_{\text{nt}} = T(y + \Delta y_{\text{nt}}).

      Thus, if minimizing f via Newton’s method gives x^{(k)}, then minimizing \bar f via Newton’s method with y^{(0)} = T x^{(0)} gives y^{(k)} = T^{-1} x^{(k)}.







      Remark 2. The Newton decrement is also invariant under the affine change x=Ty .

      \begin{aligned} \bar\lambda(y)^2&= \nabla \bar f(y)^T \nabla^2 \bar f(y)^{-1} \nabla \bar f(y)\\ &= (\nabla f(x)^T T )( T^{-1} \nabla^2 f(x)^{-1} T^{-T} ) T^T \nabla f(x)\\ &= \nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x)\\ &=\lambda(x)^2. \end{aligned}

      In particular: the stopping criterion for Newton’s method remains the same after an affine change of variables.














    Descent Algorithms for Equality Constrained Minimization
    Overview Equality Constrained Minimization.
    We will focus on equality constrained problems of the form

    \begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}

    The main assumptions are
    • f is convex.
    • f is twice continuously differentiable: \nabla^2 f is continuous.
    • An optimal solution x^\star \in \text{dom}\,f exists.
    • A \in \mathbb{R}^{p \times n} with \text{rank}\,A = p < n.
    • b \in \mathbb{R}^p is problem data.
    N.B.:
    1. \text{rank}\,A = p indicates equality constraints are independent (not superfluous).
    2. \text{rank}\, A< n indicates there are fewer equality constraints than variables.








    Goal.
    Demonstrate that Newton’s method naturally extends to equality constrained problems.
    Will cover:
    • Feasible star equality constrained Newton’s method
    • Infeasible star equality constrained Newton’s method








    Warm-up Question.
    Suppose we wish to employ Newton’s method for

    \begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}

    Suppose \Delta x_{\text{nt}} denotes the to-be-determined Newton step.
    Suppose starting point x^{(0)} \in \text{dom}\,f is feasible: Ax^{(0)} = b.
    What necessary assumption on \Delta x_{\text{nt}} should we impose so that each step x = x + t \Delta x_{\text{nt}} remains feasible?














    Equivalent Unconstrained Formulations Main Idea.
    May apply unconstrained optimization to unconstrained optimization problems equivalent to the equality constrained problem

    \begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}

    Two ways of achieving this are
    1. eliminating equality constraint by the change

      f(x) \mapsto f(Fz+x_0).

    2. formulating the Lagrangian dual problem.
    N.B.: both reformulations of the equality constrained problem may break problem structure (e.g., sparsity) and hence affect numerics.







    Equality Constraint Elimination.
    N.B.: \text{rank}\,A = p < n \implies \dim\ker A = n-p .
    Can find F \in \mathbb{R}^{n \times (n-p)} such that

    \text{range}\,F = \text{ker}\,A.

    Let x_0 solve Ax_0 = b.
    Then

    \{ x \in \mathbb{R}^n : Ax = b \} = \{ Fz + x_0 : z \in \mathbb{R}^{n-p}\}.

    Therefore

    \{ f(x) : Ax = b \} = \{ f(Fz + x_0) : z \in \mathbb{R}^{n-p}\}\\

    and so

    \begin{aligned} \inf\{f(x): Ax=b\} = \inf\{ f(Fz + x_0) : z \in \mathbb{R}^{n-p}\}. \end{aligned}









    Thus, if z^\star \in \mathbb{R}^{n-p} solves

    \begin{cases} \text{minimize} & f(Fz+x_0) \end{cases}

    then x^\star = Fz^\star + x_0 solves

    \begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax=b. \end{cases}

    May therefore use unconstrained optimization for f(Fz+x_0) to solve original equality constrained problem.
    N.B.: this confirms that minimizing the restriction of a function to an affine subspace is theoretically equivalent to an unconstrained problem.







    Using Lagrangian Dual.
    Since there are no inequality constraints, the Lagrangian of the problem is

    L(x,\nu) = f(x) + \nu^TAx - \nu^T b.

    Using Legendre transform f^* , the dual function is

    \begin{aligned} g(\nu) &= \inf\{ L(x,\nu) : x\}\\ &= \inf\{f(x) + \nu^TAx - \nu^T b\}\\ &= - \nu^T b - \sup\{ -(A^T \nu)^T x - f(x)\}\\ &=-\nu^T b - f^*(-A^T\nu). \end{aligned}









    The dual problem is therefore the unconstrained optimization problem

    \begin{cases} \text{maximize} & -\nu^T b - f^*(-A^T\nu). \end{cases}

    N.B.: g(\nu)=-\nu^T b - f^*(-A^T\nu) is not a priori twice continuously differentiable, even if f is.
    If g(\nu) is twice continuously differentiable, then can use unconstrained optimization to find dual optimizer \nu^\star and whence primal optimizer x^\star . (Assumption that x^\star exists implies problem satisfies Slater’s condition since there are no inequality constraints.)














    Quadratic Model Problem Recall:
    1. Recall

      \begin{aligned} f(x) = \frac{1}{2}x^T Q x + q^T x + r, \\ \quad Q \in \boldsymbol{S}_{++}^n, \quad q \in \mathbb{R}^n, \quad r \in \mathbb{R}. \end{aligned}

      is minimized at

      x^\star = -Q^{-1}q = -\nabla^2 f(x)^{-1} \nabla f(x).

      N.B.: This is the Newton’s step for this problem.
    2. For general f , Newton’s method is to solve problem by solving sequence of quadratic approximation problems:

      \begin{aligned} \text{Perform Quad. approx. at }x &\implies \text{Solve Quad. approx. problem for }\Delta x_{\text{nt}}\\ &\implies \text{Take Newton step } x \mapsto x + t \Delta x_{\text{nt}}\\ & \implies \text{Perform Quad. approx. at } x + t\Delta x_{\text{nt}}\\ &\implies \cdots \end{aligned}









    Idea.
    Develop Newton-type method for equality constrained problems based on minimizing the quadratic model problem

    \begin{cases} \text{minimize} & \frac{1}{2}x^T Qx + q^Tx + r\\ \text{subject to} & Ax=b, \end{cases}

    with Q,q,r,A,b as above.
    Then one may follow unconstrained idea:

    \begin{aligned} \text{Perform Quad. approx. at }x &\implies \text{Solve Quad. approx. problem for }\Delta x_{\text{nt}}\\ &\implies \text{Take Newton step } x \mapsto x + t \Delta x_{\text{nt}}\\ & \implies \text{Perform Quad. approx. at } x + t\Delta x_{\text{nt}}\\ &\implies \cdots \end{aligned}

    N.B.: the more general case Q \in \boldsymbol{S}_+^n may also be treated.







    The Model.
    For

    \begin{cases} \text{minimize} & \frac{1}{2}x^T Qx + q^Tx + r\\ \text{subject to} & Ax=b, \end{cases}

    the KKT optimality conditions are

    \begin{cases} \begin{aligned} Ax^\star &= b\\ Qx^\star + q + A^T \nu^\star &=0. \end{aligned} \end{cases}

    The second equation follows from differentiating the Lagrangian

    L(x,\nu) =\frac{1}{2}x^T Qx + q^Tx + r + \nu^T Ax + \nu^T b.

    (N.B.: Slater’s condition is satisfied if Ax=b is consistent.)







    The KKT optimality conditions are equivalent to the KKT system:

    \begin{bmatrix} Q & A^T\\ A & 0  \end{bmatrix} \begin{bmatrix} x^\star \\ \nu^\star \end{bmatrix} =  \begin{bmatrix} -q\\ b \end{bmatrix}.

    KKT matrix:

    \begin{bmatrix} Q & A^T\\ A & 0  \end{bmatrix}.

    N.B.:
    • If KKT matrix is nonsingular, then problem has unique solution.
    • If KKT matrix is singular, then either Ax=b is inconsistent, or solutions to problem are not unique.








    Solving the KKT System.
    Suppose KKT matrix is nonsingular.
    Then x^\star is obtained by computing

    \begin{bmatrix} x^\star \\ \nu^\star \end{bmatrix} =  \begin{bmatrix} Q & A^T\\ A & 0  \end{bmatrix}^{-1} \begin{bmatrix} -q\\ b \end{bmatrix}.

    But this may be costly.
    Another idea: use KKT conditions directly.

    \begin{aligned} \begin{cases} \begin{aligned} Ax^\star &= b\\ Qx^\star + q + A^T \nu^\star &=0. \end{aligned} \end{cases} &\implies x^\star = -Q^{-1}(q + A^T \nu^\star)\\ & \implies b= A x^\star = -AQ^{-1}(q + A^T \nu^\star)\\ &\implies \nu^\star = - (AQ^{-1}A^T)^{-1}(b + AQ^{-1}q) \end{aligned}

    Now, computing \nu^\star can be used to compute x^\star.







    Main Idea.
    Consider now the general problem

    \begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}

    Given initial x \in \text{dom}\,f satisfying Ax=b , let

    \begin{aligned} F(v) &= f(x) + \nabla f(x)^T v + \frac{1}{2}v^T \nabla^2 f(x) v &\approx f(x+v)  \end{aligned}

    be the quadratic approximation f at x.
    N.B.: if we want to approximate f at x+v with x+v feasible, then need Av=0 :

    b= A(x+v) = Ax+Av = b.









    Idea: if feasible x near x^\star , then F approximates f well and so, if v^\star solves

    \begin{cases} \text{minimize} & F(v) = f(x) + \nabla f(x)^T v + \frac{1}{2}v^T \nabla^2 f(x) v\\ \text{subject to} & Av = 0, \end{cases}

    then x+v^\star approximates x^\star well.
    N.B.: The point of Av^\star = 0 is so that x+v^\star remains feasible, i.e.,

    A(x+v^\star) = Ax + Av^\star = Ax = b.

    Viz., v^\star gives the feasible direction which minimizes the quadratic approximation of f .







    Linear Equality Constrained Newton’s Method Feasible directions relative to equality constraint Ax=b are any directions v \in \ker A .
    N.B.: if v is a feasible direction and x is feasible, then x+tv is feasible for all t \in \mathbb{R} . (Viz.: line searching does not break feasibility.)







    Consider now the general problem

    \begin{cases} \text{minimize} & f(x)\\ \text{subject to} & Ax = b. \end{cases}

    Newton step: a direction \Delta x_{\text{nt}} such that the system

    \begin{bmatrix} \nabla^2 f(x)  & A^T\\ A & 0  \end{bmatrix} \begin{bmatrix} \Delta x_{\text{nt}}\\ w \end{bmatrix} = \begin{bmatrix} -\nabla f(x)\\ 0 \end{bmatrix}

    is consistent for some w \in \mathbb{R}^n .
    N.B.: we only define \Delta x_{\text{nt}} when the KKT matrix

    \begin{bmatrix} \nabla^2 f(x)  & A^T\\ A & 0  \end{bmatrix}

    is nonsingular, which always holds when f is strongly convex.







    Remarks.
    1. The system

      \begin{bmatrix} \nabla^2 f(x)  & A^T\\ A & 0  \end{bmatrix} \begin{bmatrix} z_1\\z_2 \end{bmatrix} = \begin{bmatrix} -\nabla f(x)\\ 0 \end{bmatrix}

      is the KKT system for the quadratic approximation problem

      \begin{cases} \text{minimize} &  f(x) + \nabla f(x)^T v + \frac{1}{2}v^T \nabla^2 f(x) v\\ \text{subject to} & Av = 0. \end{cases}

      Thus, finding \Delta x_{\text{nt}} amounts to minimizing the equality constrained quadratic approximation of the original problem.








    2. \Delta x_{\text{nt}} is a feasible descent direction.
      Indeed, A\Delta x_{\text{nt}} = 0, and

      \nabla^2 f(x) \Delta x_{\text{nt}} + A^T w = - \nabla f(x)

      imply

      \Delta x_{\text{nt}}^T \nabla^2 f(x) \Delta x_{\text{nt}} + \Delta x_{\text{nt}}^T A^T w = - \nabla f(x)^T \Delta x_{\text{nt}},

      which implies

      0>-\Delta x_{\text{nt}}^T \nabla^2 f(x) \Delta x_{\text{nt}}  = \nabla f(x)^T \Delta x_{\text{nt}}.









    Newton Decrement.
    The Newton decrement \lambda(x) for the linear constrained Newton step is

    \lambda(x) = \left( \Delta x_{\text{nt}} \nabla^2 f(x) \Delta x_{\text{nt}} \right)^{1/2}.

    The interpretations for the unconstrained Newton decrement also hold; e.g.,

    f(x) - p^\star \approx \frac{1}{2}\lambda(x)^2 for x \approx x^\star .

    In particular: \frac{1}{2}\lambda(x)^2 < \epsilon gives a suitable stopping criterion as before.







    Linear Equality Constrained Newton’s Method.
    The Newton’s method for the linear equality constrained problem may now be stated.
    One may call the following algorithm a feasible descent method since each iteration demands the update x = x + t \Delta x_{\text{nt}} is feasible.
    
    given 
             initial x \in \text{dom}\,f  with Ax=b 
             tolerance \epsilon>0 
    repeat: 
    1. Solve
             \begin{bmatrix} \nabla^2 f(x)  & A^T\\ A & 0  \end{bmatrix} \begin{bmatrix} \Delta x_{\text{nt}}\\ w \end{bmatrix} = \begin{bmatrix} -\nabla f(x)\\ 0 \end{bmatrix} 
    and compute
             \lambda^2 = \Delta x_{\text{nt}}^T \nabla^2 f(x) \Delta x_{\text{nt}} .
    2. Stopping criterion: quit if \frac{1}{2}\lambda^2 \leq \epsilon .
    3. Perform line search to determine step size t .
    4. Take step: x = x + t\Delta x_{\text{nt}} .
    
    Interpretation.
    The equality constrained Newton’s method amounts to constructing a sequence of equality constrained quadratic minimization problems whose solutions approximate the solution of the original problem.














    Interior Point Methods
    Problem Setup We will focus on general convex optimization problems of the form

    \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i=1,\ldots,m\\  & Ax = b. \end{cases}

    The main assumptions are
    • f_0 is convex.
    • f_0 is twice continuously differentiable: \nabla^2 f_0 is continuous.
    • An optimal solution x^\star \in \text{dom}\,f_0 exists.
    • A \in \mathbb{R}^{p \times n} with \text{rank}\,A = p < n.
    • b \in \mathbb{R}^p is problem data.















    Review on Lagrange Duality The main idea of Lagrange duality is detailed in the following steps.
    We assume strong duality holds d^\star = p^\star .







    1. Build optimization problem

      \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i=1,\ldots,m\\  & h_i(x)=0, \quad i=1,\ldots,p \end{cases}









    2. Build Lagrangian

      L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x).

      and Lagrange dual function

      g(\lambda,\nu) = \inf \{L(x,\lambda,\nu) : x \text{ in domain of problem} \}.









    3. Build Lagrange dual problem

      \begin{cases} \text{maximize} & g(\lambda,\nu)\\ \text{subject to} & \lambda \succeq 0. \end{cases}

    4. Recall: if d^\star is the dual optimal value and strong duality holds, then p^\star = d^\star .







    5. Theoretically solve Lagrange dual problem for dual optimal (\lambda^\star,\nu^\star) , noting primal optimal x^\star minimizes

      x \mapsto L(x,\lambda^\star,\nu^\star).

      If

      \begin{cases} \text{minimize} & L(x,\lambda^\star,\nu^\star) \end{cases}

      has unique solution x^\sharp that is primal feasible, then primal optimal is x^\star = x^\sharp .








    6. Implementation of previous step is predicated on dual problem being simpler to solve and L(x,\lambda^\star,\nu^\star) having unique solution.
      In generality, Lagrange duality introduces the KKT optimality conditions

      \begin{aligned} f_i(x^\star) & \leq 0 , \quad i=1,\ldots,m\\ h_i(x^\star) & = 0 , \quad i=1,\ldots,p\\ \lambda_i^\star f_i(x^\star) &= 0, \quad i =1,\ldots,m\\ \lambda^\star &\succeq 0\\ \nabla_x L(x^\star,\lambda^\star,\nu^\star) &=0 \end{aligned}

      For convex problems, these conditions are necessary and sufficient for (x^\star,\lambda^\star,\nu^\star) to be optimal.








    7. In practice: either
      • x^\star is found by directly integrating KKT system or
      • (\lambda^\star,\nu^\star) found first, back substituted, then x^\star is found.
      N.B.: which route is taken or how (\lambda^\star,\nu^\star) is determined is dictated by problem structure (theoretically or numerically).















    Problem Hierarchy and Outline

    \begin{aligned} \begin{cases} \text{ECQP}:&\\ \text{minimize}& \frac{1}{2}x^TQx + q^T x + r\\ \text{subject to}& Ax=b \end{cases} \boldsymbol{\subset}    \begin{cases} \text{NCOP}:&\\ \text{minimize}& f_0(x)\\ \text{subject to}& Ax=b \end{cases}  \boldsymbol{\subset}    \begin{cases} \text{ICOP}:&\\ \text{minimize}& f_0(x)\\ \text{subject to}& f_i(x)\leq0\\ &Ax=b \end{cases} \end{aligned}

    ECQP:
    Solved via solving KKT system

    \begin{bmatrix} Q & A^T\\ A & 0  \end{bmatrix} \begin{bmatrix} x^\star \\ \nu^\star \end{bmatrix} =  \begin{bmatrix} -q\\ b \end{bmatrix}.

    Either solved directly or one appeals to the dual:

    \begin{aligned} \begin{aligned} Qx^\star + A^T \nu^\star  & = -q\\ A x^\star &=  b \end{aligned} &\implies x^\star = -Q^{-1}(q+A^T \nu^\star)\\ &\implies -AQ^{-1}(q+A^T \nu^\star)=b\\ &\implies \nu^\star =-(AQ^{-1}A^T)^{-1}(b+AQ^{-1}q) \end{aligned}

    Solving for \nu^\star allows one to compute x^\star.







    NCOP:
    Solved via sequence of approximating ECQP’s: at each iteration, approximate f_0(x) via

    F(v) = \frac{1}{2}v^T \nabla^2 f_0(x) v + \nabla f_0(x)^T v + f_0(x)

    and solve the ECQP

    \begin{cases} \text{minimize}& \frac{1}{2}v^T \nabla^2 f_0(x) v + \nabla f_0(x)^T v + f_0(x)\\ \text{subject to}& Av=0 \end{cases}

    for the Newton step \Delta x_{\text{nt}} = v^\star .







    This is equivalent to solving the KKT system

    \begin{bmatrix} \nabla^2 f_0(x) & A^T\\ A & 0  \end{bmatrix} \begin{bmatrix} \Delta x_{\text{nt}} \\ w \end{bmatrix} =  \begin{bmatrix} -\nabla f_0(x)\\ 0 \end{bmatrix}.

    From above: we may first solve

    w =-(A\nabla^2 f_0(x)^{-1} A^T)^{-1}(A\nabla^2 f_0(x)^{-1} \nabla f_0(x))

    and back substitute to obtain

    \Delta x_{\text{nt}} = -\nabla^2 f_0(x)^{-1}(\nabla f_0(x)+A^T w)

    N.B.: A=0 recovers unconstrained Newton step.







    ICOP:
    The main idea is to create a sequence of NCOP’s whose solutions approximate the solution to the ICOP.
    A first approximation: let

    I_{-}(u) =  \begin{cases} 0 & u \leq 0\\ +\infty & u>0. \end{cases}

    Then the ICOP is equivalent to the equality constrained convex optimization problem

    \begin{cases} \text{minimize}& f_0(x) + \sum_{i=1}^m I_{-}(f_i(x))\\ \text{subject to}& Ax=b \end{cases}

    N.B.: This new objective function need not be smooth and therefore Newton’s need not apply in general.







    Plan:
    Devise an approximation scheme of this nonsmooth equality constrained convex optimization problem.
    This is achieved via solving a sequence of NCOP’s whose solutions converge to a solution to the ICOP.














    Logarithmic Barriers Logarithmic Approximation of Indicator Function:
    The indicator function

    I_{-}(u) =  \begin{cases} 0 & u \leq 0\\ +\infty & u>0. \end{cases}

    is smoothly approximated by the logarithm

    \hat{I}_{-}(u) =  \begin{cases} -\frac{1}{t}\log(-u) & u \leq 0\\ +\infty & u>0, \end{cases}

    where t>0 is a parameter dictating the accuracy of approximation.







    The samples t=1,2,4,8,16 (solid curves) and I_- (dashed curve) are plotted below.








    N.B.: recall

    s \to 0  \implies c^{s}\to 1 for c>0 .

    Thus, for large t>0 and fixed u<0 , we have

    -\frac{1}{t}\log(-u) = - \log \left((-u)^{1/t} \right)\approx 0

    and for negative u \approx 0 and fixed t, we have

    -\frac{1}{t}\log(-u) \approx \infty.

    Therefore, -\frac{1}{t}\log(-u) approximates I_-(u) well for large t.







    The Logarithmic Barriers:
    With the equivalence

    \begin{aligned} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0\\  & Ax = b \end{cases} \iff \begin{cases} \text{minimize}& f_0(x) + \sum I_{-}(f_i(x))\\ \text{subject to}& Ax=b \end{cases} \end{aligned}

    and approximation

    I_-(u) \approx \hat{I}_-(u) := -\frac{1}{t}\log(-u)

    we introduce the following approximating logarithmic barrier problems:

    \begin{aligned} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0\\  & Ax = b \end{cases} &\iff \begin{cases} \text{minimize}& f_0(x) + \sum I_{-}(f_i(x))\\ \text{subject to}& Ax=b \end{cases} \\ &\quad\approx \text{(LBP)} \begin{cases} \text{minimize}& f_0(x) -\frac{1}{t} \sum \log(-f_i(x))\\ \text{subject to}& Ax=b \end{cases} \end{aligned}









    Remarks:
      • The approximating term

        \phi(x):=-\sum_{i=1}^m \log(-f_i(x))

        is called a logarithmic barrier for the problem.
      • N.B.: - \log(-u) >0 for 0<-u<1 and so \phi(x) penalizes x closer to boundary of feasible set.
      • The t in f_0(x) + \frac{1}{t}\phi(x) controls this penalty since

        t \approx \infty \implies -\frac{1}{t}\log(-u) \approx 0

        for each fixed u<0.








      • If f_0,f_i are twice continuously differentiable, then so is \phi on its domain

        \text{dom}\,\phi = \{ x : f_i(x) < 0,\, i =1,\ldots,m\}.

      • Moreover, \psi(x) convex implies -\log \psi(x) is convex, and so \phi is also convex.
      • Therefore, the approximating LBP’s are convex and Newton’s method is applicable for each t>0 .
        (“Applicable” means “can be ran” and not “will necessarily converge”.)








    1. Thus each LBP is a NCOP for each t>0 .
      Moreover,

      f_0 + \frac{1}{t}\phi \to f_0 + \sum_{i=1}^m I_-(f_i(x))\text{ as } t \to \infty.

      This suggests the approximation scheme: for each t>0 , use Newton’s method to solve the approximating LBP

      \begin{cases} \text{minimize}& f_0(x) + \frac{1}{t} \phi(x) \\ \text{subject to}& Ax=b \end{cases}

      for primal optimal x^\star(t) , and prove x^\star(t) \to x^\star as t \to \infty .








    2. Therefore, informally speaking, we have

      \begin{aligned} \begin{cases} \text{minimize}& f_0(x) + \frac{1}{t}\phi(x)\\ \text{subject to}& Ax=b \end{cases} \to  \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0\\  & Ax = b \end{cases} \end{aligned}

      as t\to \infty .







    3. For sake of notational convenience: there holds the equivalence

      \begin{aligned} \begin{cases} \text{minimize}& f_0(x) + \frac{1}{t} \phi(x) \\ \text{subject to}& Ax=b \end{cases} \iff \begin{cases} \text{minimize}& tf_0(x) + \phi(x) \\ \text{subject to}& Ax=b \end{cases}. \end{aligned}

      In fact, both problems have the same primal optimizers.







    4. To use the KKT conditions for the approximating LBP’s, we record the gradient and Hessian of a general logarithmic barrier \phi(x):

      \begin{aligned} \nabla \phi(x) &= -\nabla \left( \sum_{i=1}^m \log(-f_i (x)) \right) \\ &=- \sum_{i=1}^m \frac{1}{f_i(x)} \nabla f_i(x)\\ \nabla^2 \phi(x) &= \sum_{i=1}^m \frac{1}{f_i(x)^2} \nabla f_i (x) \nabla f_i(x)^T - \sum_{i=1}^m \frac{1}{f_i(x)} \nabla^2 f_i(x). \end{aligned}

      N.B.: \nabla f_i (x) \nabla f_i(x)^T is an n \times n matrix.















    Central Path We will always suppose that, for each t>0 , the LBP

    \begin{cases} \text{minimize}& tf_0(x) + \phi(x) \\ \text{subject to}& Ax=b \end{cases}

    has a unique solution x^\star(t) .
    We will also call this problem a centering problem.
    For concreteness, we will assume this problem is uniquely solvable for each t>0 via Newton’s method.







    The x^\star(t) form the central path of

    \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i = 1,\ldots,m\\  & Ax = b \end{cases}

    which is a path in the feasible set F of this problem.
    Each x^\star(t) is called a central point.







    An example of a central path is depicted below.
    Each point along the curve indicates a solution to an approximating centering problem.
    Intuitively, these solutions should converge to the solution of the original problem.















    Therefore, since the problem is convex: for each t>0 , a point x^\star(t) is central iff the following hold:

    \begin{aligned} \text{strict feasibility}&  \begin{cases} Ax^\star(t) = b\\ f_i(x^\star(t)) < 0 \end{cases}\\ \text{KKT}& \begin{cases} 0&=t \nabla f_0(x^\star(t)) + \nabla \phi(x^\star(t)) + A^T \nu_t\\ &= t \nabla f_0(x^\star(t)) - \sum_{i=1}^m  \frac{1}{f_i(x^\star(t))} \nabla f_i(x^\star(t)) + A^T \nu_t \end{cases}\\ \end{aligned}

    for some \nu_t \in \mathbb{R}^p .
    N.B.: strict feasibility follows from f_i(x) = 0 \implies \phi(x) = \infty.














    Convergence and Dual Central Path Objectives:
    1. show x^\star(t) naturally determines a path of dual feasible points (\lambda^\star(t),\nu^\star(t)).
    2. show that x^\star(t) \to x^\star and f_0(x^\star(t)) \to p^\star as t \to \infty.
    Emphasis: the dual central path (\lambda^\star(t),\nu^\star(t)) can serve as certificates that x^\star(t) is suboptimal with desired tolerance: f_0(x^\star(t)) - p^\star < \epsilon .
    In short: the objectives follow from Lagrange duality and the KKT conditions for the approximating LBP’s.







    Theorem. Let x^\star(t) be the central path. Then

    f_0(x^\star(t)) - p^\star \leq \frac{m}{t}

    and there exists \nu_t \in \mathbb{R}^p such that

    \lambda^\star_i(t) := -\frac{1}{t f_i(x^\star(t))},\quad \nu^\star(t) := \frac{1}{t}\nu_t

    form a path (\lambda^\star(t),\nu^\star(t)) of dual feasible points.








    Remarks.
    1. Recall: m denotes the number of inequality constraints of the original problem.
    2. The estimate f_0(x^\star(t)) - p^\star \leq \frac{m}{t} implies

      \begin{aligned} f_0(x^\star(t)) &\to p^\star\\ x^\star(t) & \to x^\star. \end{aligned}

      Moreover, it determines exactly which t>0 ensures \epsilon -suboptimality:

      \frac{m}{\epsilon}\leq t \implies f_0(x^\star(t)) - p^\star \leq \epsilon.

    3. Recall: a pair (\lambda,\nu) is called dual feasible if \lambda \succeq 0 and g(\lambda,\nu)>-\infty, where g is the Lagrange dual function of the original problem.








    Proof.
    Outline
    1. Determine KKT conditions for approximating centering problems; this will determine \nu^\star(t).
    2. Compare with Lagrangian for original problem; this will determine \lambda^\star(t).
    3. Show (\lambda^\star(t),\nu^\star(t)) is dual feasible.
    4. Use Lagrange dual function and duality gap to conclude suboptimality estimate.








    1. For each t>0, the Lagrangian for the approximating centering problem

      \begin{cases} \text{minimize}& tf_0(x) + \phi(x) \\ \text{subject to}& Ax=b \end{cases}

      is

      L_t(x,\lambda) = tf_0(x) + \phi(x) + \nu^T Ax - \nu^T b.

      Thus, the KKT conditions are

      \begin{cases} Ax^\star(t)=b\\ t \nabla f_0(x^\star(t)) + \nabla \phi(x^\star(t)) + A^T \nu_t = 0, \end{cases}

      for some \nu_t \in \mathbb{R}^p .
      Set

      \nu^\star(t) := \frac{1}{t}\nu_t.

      (Viz.: \nu^\star(t) is a scaled dual optimal Lagrange multiplier for the approximating centering problem at time t.)








    2. The Lagrangian for the original problem

      \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, \quad i = 1,\ldots,m\\  & Ax = b \end{cases}

      is

      \begin{aligned} L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \nu^T Ax - \nu^T b, \end{aligned}

      whose gradient is

      \begin{aligned} \nabla_x L(x,\lambda,\nu) = \nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^T \nu. \end{aligned}

      We show that L(x,\lambda,\nu) is minimized at x^\star(t) for suitably chosen (\lambda,\nu), which will be given by KKT conditions of the approximating centering problems.















      Compare the gradient of L at x^\star(t) and the KKT conditions for the approximating centering problems:

      \begin{aligned} \nabla_x L(x^\star(t),\lambda,\nu) &= \nabla f_0(x^\star(t)) + \sum_{i=1}^m \lambda_i \nabla f_i(x^\star(t)) + A^T \nu \\ \frac{1}{t}\nabla_x L_t(x,\nu) & =  \nabla f_0(x^\star(t)) - \sum_{i=1}^m  \frac{1}{tf_i(x^\star(t))} \nabla f_i(x^\star(t)) + A^T \frac{1}{t}\nu_t\\  &=0 \end{aligned}

      Matching coefficients, we see choosing

      \begin{aligned} \lambda_i &= - \frac{1}{t f_i(x^\star(t))}, \qquad \nu = \frac{1}{t}\nu_t. \end{aligned}

      effects

      \nabla L(x^\star(t),\lambda,\nu) = 0.









      Indeed, if

      \begin{aligned} \lambda_i^\star(t) &= - \frac{1}{t f_i(x^\star(t))}, \qquad \nu^\star(t) = \frac{1}{t}\nu_t, \end{aligned}

      then

      \begin{aligned} \nabla_x L(x^\star(t),\lambda^\star(t),\nu^\star(t)) &= \nabla f_0(x^\star(t)) + \sum_{i=1}^m \lambda_i^\star(t) \nabla f_i(x^\star(t)) + A^T \nu^\star(t) \\ &=  \nabla f_0(x^\star(t)) - \sum_{i=1}^m  \frac{1}{tf_i(x^\star(t))} \nabla f_i(x^\star(t)) + A^T \frac{1}{t}\nu_t\\ &= \frac{1}{t}\nabla_xL_t(x^\star(t),\nu_t) \\ &=0 \end{aligned}









    3. We now show (\lambda^\star(t),\nu^\star(t)) is dual feasible, i.e., that

      \begin{aligned} \lambda^\star(t) & \succeq 0\\ g(\lambda^\star(t),\nu^\star(t))&> -\infty. \end{aligned}

      First observe:

      \begin{aligned} f_i (x^\star(t)) < 0 & \implies \lambda_i^\star (t) = - \frac{1}{t f_i(x^\star(t))} >0\\ &\implies \lambda^\star(t) \succeq 0. \end{aligned}









    4. Secondly, observe that x\mapsto L(x,\lambda^\star(t),\nu^\star(t)) is convex and so, since x^\star(t) is a critical point of this function, it is a minimizer.
      Consequently, the dual function g(\lambda,\nu) is finite at (\lambda^\star(t),\nu^\star(t)):

      \begin{aligned} g(\lambda^\star(t),\nu^\star(t)) &= \inf\{L(x,\lambda^\star(t),\nu^\star(t)): x\text{ feasible}\}\\ &= f_0(x^\star(t)) + \sum_{i=1}^m \lambda_i^\star(t)f_i(x^\star(t)) + \nu^{\star}(t)^T(Ax^\star(t)-b)\\ &= f_0(x^\star(t)) -  \sum_{i=1}^m \frac{1}{tf_i(x^\star(t))}f_i(x^\star(t)) + \frac{1}{t}\nu_t \cdot 0\\ &= f_0(x^\star(t)) - \frac{m}{t}\\ &>-\infty \end{aligned}

















    5. Lastly, we observe that the evaluation of g(\lambda^\star(t),\nu^\star(t)) paired with the optimality gap gives:

      \begin{aligned} g(\lambda^\star(t),\nu^\star(t)) = &f_0(x^\star(t)) - \frac{m}{t}\\  &\implies f_0(x^\star(t)) - g(\lambda^\star(t),\nu^\star(t)) = \frac{m}{t}\\ &\implies f_0(x^\star(t)) - \sup g(\lambda,\nu) \leq \frac{m}{t}\\ &\implies f_0(x^\star(t)) - p^\star \leq \frac{m}{t}\\ \end{aligned}
















    The Barrier Method We are now ready to write out the algorithm which uses logarithmic barriers and centering problems to approximate the solution to an inequality constrained convex optimization problem.

    This algorithm is called the barrier method; aka sequential unconstrained minimization technique (SUMT) by Fiacco-McCormick or path-following method.
    
    given 
             initial strictly feasible x 
             initial time t > 0 
             multiplier \mu>1 
             tolerance \epsilon>0 
    repeat: 
    1. Centering step. Find center x^\star(t)  by solving
             \begin{cases} \text{minimize} & tf_0+\phi\\ \text{subject to} & Ax=b \end{cases} 
       with initial point x.
    2. Update. x:= x^\star(t) .
    3. Stopping criterion. quit if \frac{m}{\epsilon}<t .
    4. Increase t. t:=\mu t.
    








    Remarks.
    1. The iterations in Newton’s method to solve centering problem are called inner iterations.
      The execution of Step 1. is called an outer iteration.







    2. Size of \mu dictates trade-off between number of inner and outer iterations:
      • \mu small: \mu \approx 1 \implies t \approx \mu t \implies x^\star(t) \approx x^\star(\mu t).
        Thus,
        • x^\star(t) is good initial point to compute x^\star(\mu t)
          \implies few Newton steps to move from x^\star(t) to x^\star(\mu t).
          \implies few inner iterations per outer iteration.
        • However: x^\star(t) \approx x^\star(\mu t)
          \implies algorithm moves along central path slowly
          \implies many outer iterations.
        • Newton steps follow along central path quite well.








      • \mu large: \mu \gg 1 \implies \mu t \gg t \implies x^\star(t) and x^\star(\mu t) quite separated.
        Thus,
        • x^\star(t) poor initial point to compute x^\star(\mu t)
          \implies many Newton steps to move from x^\star(t) to x^\star(\mu t)
          \implies many inner iterations.
        • However: large separation of x^\star(t) and x^\star(\mu t)
          \implies algorithm moves along central path quickly
          \implies few outer iterations.
        • Newton steps may diverge far from central path.








    3. Size of t also affects number of inner and outer iterations:
      • The larger t is:
        • the faster the algorithm moves along central path
          \implies the fewer outer iterations needed
        • the closer x^\star(t) is to x^\star
        • However: if initial x far from x^\star , then may require many initial inner iterations.
        • (These observations apply to aggressive choice of t = m/\epsilon, where algorithm requires one outer iteration.)








      • The smaller t is:
        • the slower the algorithm moves along central path
          \implies the more outer iterations needed
        • the farther x^\star(t) is to x^\star
        • Moreover: if initial x far from x^\star(t) and near x^\star , then may require many superfluous initial inner iterations.















    Modified KKT Conditions Recall:
    • the KKT optimality conditions for centering problem at time t are

      \begin{cases} Ax = b\\ t\nabla f_0(x) - \sum_{i=1}^m \frac{1}{f_i(x)} \nabla f_i(x) + A^T \nu = 0. \end{cases}

    • Central path x^\star(t) satisfies strict feasibility

      f_i(x^\star(t)) < 0, \quad t = 1,\ldots,m

      and defines dual central path (\lambda^\star(t),\nu^\star(t)) with

      \lambda^\star (t) = - \frac{1}{t f_i(x^\star(t))}.









    Combining the two: x is a central point iff there is a pair (\lambda,\nu) satisfying the modified KKT conditions:

    \begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \lambda > 0\\ -\lambda_i f_i(x) = \frac{1}{t}\\ \nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^T \nu = 0 \end{cases}

    Main Point.
    Certain interior point methods amount to iteratively solving this system.







    Remarks.
    1. modified KKT conditions are a “continuous” deformation of KKT conditions for original problem.
      Evidently: as t \to \infty , the modified KKT conditions converge to the original KKT conditions.
    2. Complementary slackness

      \lambda_i f_i(x) = 0

      is now “almost” complementary slackness

      -\lambda_i f_i(x) = \frac{1}{t}\\
















    Newton Step for Centering Problem Recall: first step in the barrier method is solving the centering problem

    \begin{cases} \text{minimize} & tf_0+\phi\\ \text{subject to} & Ax=b \end{cases}

    where \phi is a log barrier.
    Solving the centering problem using Newton’s method amounts to iteratively solving the KKT systems

    \begin{bmatrix} t\nabla^2 f_0(x) + \nabla^2 \phi(x) & A^T\\ A & 0  \end{bmatrix} \begin{bmatrix} \Delta x_{\text{nt}}\\ \nu_{\text{nt}} \end{bmatrix} = \begin{bmatrix} -t\nabla f_0(x) - \nabla \phi(x)\\ 0 \end{bmatrix}

    Remark.
    Solving this system turns out to be directly related to solving the modified KKT equations, which is generally a nonlinear system.







    To achieve this, first we form

    \begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \nabla f_0(x) - \sum_{i=1}^m \frac{1}{t f_i(x)}\nabla f_i(x) + A^T \nu = 0 \end{cases}

    from the modified KKT conditions

    \begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \lambda > 0\\ -\lambda_i f_i(x) = \frac{1}{t}\\ \nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^T \nu = 0 \end{cases}

    by performing the elimination

    -\lambda_i f_i(x) = \frac{1}{t}.









    Then finding the Newton step to solve centering problem is equivalent to finding the Newton step to solve the nonlinear system

    \begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \nabla f_0(x) - \sum_{i=1}^m \frac{1}{t f_i(x)}\nabla f_i(x) + A^T \nu = 0. \end{cases}

    Recall (Newton’s Method for Nonlinear Equations) One approach to solving a nonlinear system of equations

    \begin{aligned} F&:\mathbb{R}^k \to \mathbb{R}^k\\ F(X) &= (F_1(X),\ldots,F_k(X)) = 0 \end{aligned}

    is through Newton’s method.








    Here, one considers the iteration scheme

    \begin{aligned} X^{(k+1)} &= X^{(k)} + \Delta X_{\text{nt}}\\ \Delta X_{\text{nt}} &= - J_F^{-1}(X^{(k)})F(X^{(k)})\\ J_F(X)&=  \begin{bmatrix} \frac{\partial F_1}{\partial X_1} & \cdots & \frac{\partial F_1}{\partial X_k}\\ \vdots & \ddots & \vdots\\ \frac{\partial F_k}{\partial X_1} & \cdots & \frac{\partial F_k}{\partial X_k} \end{bmatrix} \end{aligned}

    We call \Delta X_{\text{nt}} the Newton step for solving the nonlinear system F(X)=0.
    Under suitable circumstances, if X^\star solves F(X)=0, and X^{(0)} near X^\star , then X^{(k)} \to X^\star .















    Primal-Dual Interior Point Method Recall: Newton’s method for the inner iterations to solve the centering problems is equivalent to using Newton’s method to solve the nonlinear modified KKT conditions

    \begin{cases} Ax = b\\ f_i(x)<0, \quad i =1,\ldots,m\\ \nabla f_0(x) - \sum_{i=1}^m \frac{1}{t f_i(x)}\nabla f_i(x) + A^T \nu = 0 \end{cases}

    after performing the elimination

    -\lambda_i f_i(x) = \frac{1}{t}.









    If we instead use Newton’s method to solve the full modified KKT conditions

    \begin{cases} Ax = b\\ f_i(x) < 0, \quad i=1,\ldots,m\\ \lambda > 0\\ -\lambda_i f_i(x) = \frac{1}{t}\\ \nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^T \nu = 0 \end{cases}

    we develop a new algorithm which is an example of a “primal-dual interior point method”.







    Main Features.
    1. There are no inner iterations.
    2. Both primal and dual variables are updated each iteration.
    3. Primal and dual iterates need not be feasible.
    4. often outperforms barrier method.








    Primal-dual search direction.
    Define

    \begin{aligned} f(x) =  \begin{bmatrix} f_1(x)\\\vdots\\f_m(x) \end{bmatrix} ,\qquad Df(x)= \begin{bmatrix} \nabla f_1(x)^T\\\vdots\\\nabla f_m(x)^T \end{bmatrix}\\ \text{diag}(\lambda) = \lambda  \begin{bmatrix} 1 & 0 &\cdots & 0 \\ 0 & 1 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 &\cdots & 1 \end{bmatrix} ,\qquad  \boldsymbol{1} = \begin{bmatrix} 1\\\vdots\\1\end{bmatrix}\\ r_t(x,\lambda,\nu) = \begin{bmatrix}\nabla f_0(x) + Df(x)^T\lambda + A^T\nu\\ -\text{diag}(\lambda)f(x) - \frac{1}{t}\boldsymbol{1}\\ Ax-b \end{bmatrix} \end{aligned}









    N.B.:

    \begin{cases} r_t(x,\lambda,\nu) =0\\ f_i(x)<0 \end{cases}

    are exactly the modified KKT conditions.
    Thus, if (x,\lambda,\nu) solves this system, then

    (x,\lambda,\nu) = (x^\star(t),\lambda^\star(t),\nu^\star(t)).









    Define the following residuals

    \begin{aligned} \text{dual residual} &= r_{\text{dual}} = \nabla f_0(x) + Df(x)^T \lambda + A^T \nu\\ \text{centrality residual} &= r_{\text{cent}} = -\text{diag}(\lambda)f(x) - \frac{1}{t}\boldsymbol{1}\\ \text{primal residual} &= r_{\text{pri}} = Ax-b. \end{aligned}

    Remarks.
    1. These residuals are just the blocks of r_t :

      r_t = \begin{bmatrix} r_{\text{dual}} \\ r_{\text{cent}} \\ r_{\text{pri}} \end{bmatrix}.

    2. r_{\text{dual}}(x,\lambda,\nu) measures divergence from (\lambda,\nu) from being dual feasible.
    3. r_{\text{pri}}(x,\lambda,\nu) measures divergence from x being primal feasible.
    4. r_{\text{cent}}(x,\lambda,\nu) measures divergence from x and (\lambda,\nu) from being central.
    5. All three \approx 0 means (x,\lambda,\nu) nearly solves modified KKT condition.








    Using Newton’s Method.
    Applying Newton’s method to solve the nonlinear system

    r_t(x,\lambda,\nu) = 0

    at a point y = (x,\lambda,\nu) satisfying f(x) \prec 0 \prec \lambda results in a Newton step

    \Delta y = (\Delta x, \Delta \lambda, \Delta \nu)

    given by

    \Delta y = -J_{r_t}(y)^{-1}r_t(y),

    where

    J_{r_t}(y)= Dr_t(y) = \begin{bmatrix} \nabla^2 f_0(x) + \sum_{i=1}^m \lambda_i \nabla^2 f_i(x) & Df(x)^T & A^T\\ -\text{diag}(\lambda)Df(x) & -\text{diag}(f(x)) & 0\\ A & 0 & 0 \end{bmatrix}.









    Viz.: the Newton step solves

    \begin{aligned} \begin{bmatrix} \nabla^2 f_0(x) + \sum_{i=1}^m \lambda_i \nabla^2 f_i(x) & Df(x)^T & A^T\\ -\text{diag}(\lambda)Df(x) & -\text{diag}(f(x)) & 0\\ A & 0 & 0 \end{bmatrix} \begin{bmatrix} \Delta x\\ \Delta \lambda \\ \Delta \nu \end{bmatrix} = - \begin{bmatrix} r_{\text{dual}} \\ r_{\text{cent}} \\ r_{\text{pri}} \end{bmatrix} \end{aligned}

    The solution to this system is called the primal-dual search direction and will be denoted by

    \Delta y_{\text{pd}} = (\Delta x_{\text{pd}},\Delta \lambda_{\text{pd}},\Delta \nu_{\text{pd}})









    Surrogate Duality Gap.
    The primal-dual search directions \Delta y_{\text{pd}} need not form a sequence (x^{(k)},\lambda^{(k)},\nu^{(k)}) of feasible points.
    Therefore, f_0(x^{(k)}) - g(\lambda^{(k)},\nu^{(k)}) need not measure the duality gap.

    In place of duality gap, we use the surrogate duality gap: for (x,\lambda) satisfying f(x) \prec 0 \prec \lambda , the number

    \hat\eta(x,\lambda) = - f(x)^T \lambda.









    N.B.: if x feasible and (\lambda,\nu) dual feasible, then

    \begin{aligned} \hat\eta(x,\lambda) &= -f(x)^T \lambda\\ &= -f_1(x)\lambda_1 - \cdots - f_m(x) \lambda_m\\ &= \frac{1}{t} + \cdots + \frac{1}{t}\\ &= \frac{m}{t}\\ &= f_0(x) - g(\lambda,\nu) \end{aligned}

    Viz.: r_{\text{pri}} =0, \, r_{\text{dual}} = 0 \implies \hat\eta(x,\lambda,\nu) is duality gap.
    Therefore, r_{\text{pri}},r_{\text{dual}},\hat\eta can be used to define stopping criterion.
    Indeed: \Vert r_{\text{pri}} \Vert_2 and \Vert r_{\text{dual}} \Vert_2 small \implies \hat\eta nearly duality gap.
    Therefore, all three quantities small \implies small duality gap.







    Primal-dual interior point method.
    We may now introduce the algorithm called primal-dual interior point method.
    
    given 
             initial x  satisfying f_1(x)< 0,\ldots, f_m(x) < 0 
             initial \lambda \succ 0  and \nu 
             multiplier \mu>1 
             tolerances \epsilon>0, \epsilon_{\text{feas}}>0 
    repeat: 
    1. Determine t . Set t = \mu m / \hat\eta 
    2. Compute primal-dual search direction \Delta y_{\text{pd}}
    3. Perform line search and update. 
             Determine step length s>0  and set y = y+s\Delta y_{\text{pd}} .
    until: \Vert r_{\text{pri}} \Vert_2 \leq \epsilon_{\text{feas}} , \Vert r_{\text{dual}} \Vert_2 \leq \epsilon_{\text{feas}}  and \hat\eta \leq \epsilon .
    
    N.B.:
    1. the line search is a modified back-tracking line search which ensures f(x) \prec 0 \prec \lambda always holds.
    2. for feasible parameters, the value t= m/\hat\eta corresponds to a duality gap of \hat\eta.















    Appendix
    Differentiating b^Tx Given

    b = \begin{bmatrix}b_1\\b_2\\\vdots\\b_n\end{bmatrix} \in \mathbb{R}^n ,

    define the scalar function

    g(x) = b^Tx = b_1x_1 + \cdots + b_n x_n .

    Using the Taylor expansion

    \begin{aligned} g(y) = g(x) + (y-x)^T \nabla g(x) + \cdots \end{aligned}

    at y = 0, we find

    \begin{aligned} 0 = b^Tx - x^T \nabla g(x) + \cdots \end{aligned}

    and so

    x^T \nabla g(x) = b^T.

    To see this computed directly, computing

    \begin{aligned}  \frac{\partial}{\partial x_k} g(x) &= \frac{\partial}{\partial x_k}(b_1x_1 + \cdots + b_nx_n)\\ &=b_k, \end{aligned}

    we conclude

    \nabla g = \begin{bmatrix}\frac{\partial}{\partial x_1} g(x) \\ \frac{\partial}{\partial x_2}g(x) \\ \vdots \\ \frac{\partial}{\partial x_n}g(x) \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix} = b.

    Differentiating \frac{1}{2}x^TQx Given

    Q  = \begin{bmatrix} q_{11} & q_{12} & \cdots & q_{1n}\\ q_{21} & q_{22} & \cdots & q_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ q_{n1} & q_{n2} & \cdots & q_{nn} \end{bmatrix} \in \boldsymbol{S}^n

    define the scalar function

    f(x) = \frac{1}{2}x^T Q x.

    Using the Taylor expansion

    \begin{aligned} f(y) = f(x) + (y-x)^T \nabla f(x) + \frac{1}{2} (y-x)^T \nabla^2 f(x) (y-x) + \cdots \end{aligned}

    at y = 0 , we find

    \begin{aligned} 0 &= \frac{1}{2}x^TQx - x^T \nabla f(x) + \frac{1}{2} x^T \nabla^2 f(x)x + \cdots\\ &= x^TQx - x^T \nabla f(x) - \frac{1}{2}x^TQx + \frac{1}{2}x^T\nabla^2 f(x) x + \cdots, \end{aligned}

    which is evidently satisfied in case

    \begin{aligned} \nabla f(x) &= Qx\\ \nabla^2 f(x) &= Q. \end{aligned}

    To see this computed directly: expanding the matrix multiplication, we have

    \begin{aligned} f(x) = \frac{1}{2} \sum_{i,j=1}^{n} x_i x_j q_{ij}  \end{aligned}  .

    Computing

    \begin{aligned}  \frac{\partial}{\partial x_k} f(x) &= \frac{1}{2}\sum_{i,j=1}^n \frac{\partial}{\partial x_k} (x_i x_j q_{ij})\\ &=\frac{1}{2} \left(\sum_{i=1}^n x_i q_{ik} + \sum_{j=1}^n x_j q_{kj} \right)\\ &=\frac{1}{2}\sum_{i=1}^n x_i(q_{ik}+q_{ki})\\ &=\sum_{i=1}^n x_iq_{ik} \end{aligned}

    we have

    \begin{aligned} \nabla f(x) &=  \begin{bmatrix} \frac{\partial}{\partial x_1} f(x)\\ \frac{\partial}{\partial x_2} f(x)\\ \vdots\\ \frac{\partial}{\partial x_n} f(x) \end{bmatrix} =Qx \end{aligned} .

    Taking the second derivative gives

    \begin{aligned}  \frac{\partial^2}{\partial x_k \partial x_l} f(x) &= \sum_{i=1}^n\frac{\partial}{\partial x_l}x_i q_{ik}\\ &=q_{lk}. \end{aligned}

    Consequently, there holds

    \begin{aligned} \nabla^2 f(x) &=  \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(x) \\ \frac{\partial^2 f}{\partial x_2 \partial x_1 }(x) & \frac{\partial^2 f}{\partial x_2^2 }(x) & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(x) \\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n \partial x_1}(x) & \frac{\partial^2 f}{\partial x_n \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x) \\ \end{bmatrix}\\ & = Q. \end{aligned}

    Example. Let

    \begin{aligned} f(x_1,x_2) &= \frac{1}{2}\begin{bmatrix}x_1 & x_2 \end{bmatrix} \begin{bmatrix} a&b\\c&d \end{bmatrix} \begin{bmatrix}x_1\\x_2 \end{bmatrix}\\ &=\frac{1}{2} \begin{bmatrix}x_1&x_2\end{bmatrix}\begin{bmatrix}ax_1+bx_2\\cx_1+dx_2\end{bmatrix}\\ &=\frac{1}{2}(ax_1^2 + (b+c)x_1x_2 + dx_2^2). \end{aligned}

    Compute

    \begin{aligned} \frac{\partial}{\partial x_1} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_1} (ax_1^2 + (b+c)x_1x_2 + dx_2^2)\\ &=\frac{1}{2}(2ax_1 + (b+c)x_2)\\ \frac{\partial}{\partial x_2} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_2} (ax_1^2 + (b+c)x_1x_2 + dx_2^2)\\ &=\frac{1}{2}((b+c)x_1 + 2dx_2)\\ \end{aligned}.

    Consequently,

    \begin{aligned} \nabla f(x) &= \begin{bmatrix} \frac{\partial}{\partial x_1} f(x) \\ \frac{\partial}{\partial x_2} f(x) \end{bmatrix}\\ &=\begin{bmatrix} \frac{1}{2}(2ax_1 + (b+c)x_2)\\ \frac{1}{2}((b+c)x_1 + 2dx_2) \end{bmatrix}\\ &=\frac{1}{2} \left(\begin{bmatrix}ax_1 + bx_2 \\ cx_1 + dx_2 \end{bmatrix} + \begin{bmatrix} ax_1 + cx_2 \\ bx_1 + dx_2 \end{bmatrix} \right)\\ &=\frac{1}{2}\left( \begin{bmatrix}a&b\\c&d\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix} + \begin{bmatrix}a&c\\b&d\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix}\right)\\ &=\frac{1}{2}(Q+Q^T)x. \end{aligned}

    Next compute the second derivatives:

    \begin{aligned} \frac{\partial^2}{\partial x_1^2} f(x_1,x_2) &=\frac{1}{2}\frac{\partial}{\partial x_1}(2ax_1 + (b+c)x_2)\\ &=a\\ \frac{\partial^2}{\partial x_1 \partial x_2} f(x_1,x_2) &=\frac{1}{2}\frac{\partial}{\partial x_2}(2ax_1 + (b+c)x_2)\\ &=\frac{1}{2}(b+c)\\ \frac{\partial^2}{\partial x_2 \partial x_1} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_1}((b+c)x_1 + 2dx_2)\\ &=\frac{1}{2}(b+c)\\ &=\frac{1}{2}(b+c)\\ \frac{\partial^2}{\partial x_2^2} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_2}((b+c)x_1 + 2dx_2)\\ &=d. \end{aligned}

    Putting this together gives

    \begin{aligned} \nabla^2 f(x) &= \begin{bmatrix} \frac{\partial^2}{\partial x_1^2}f & \frac{\partial^2}{\partial x_1 \partial x_2}f\\ \frac{\partial^2}{\partial x_2 \partial x_1}f & \frac{\partial^2}{\partial x_2^2}f \end{bmatrix}\\ &= \frac{1}{2}\begin{bmatrix} a & b+c\\ c+b & d \end{bmatrix}\\ &= \frac{1}{2}\begin{bmatrix} a & b\\ c& d \end{bmatrix} + \frac{1}{2}\begin{bmatrix}a&c\\b&d\end{bmatrix}\\ &= \frac{1}{2}(Q+Q^T). \end{aligned}

    Differentiating b^TAx Given A \in \mathbb{R}^{m \times n} , b \in \mathbb{R}^m , define the scalar function h(x) = b^TAx.
    Using the Taylor expansion

    h(y) = h(x) + (y-x)^T\nabla h(x) + \cdots

    at y = 0, we get

    \begin{aligned} 0 &= b^TAx - x^T \nabla h(x) + \cdots\\ &= x^T(Ab^T) - x^T \nabla h(x) + \cdots \end{aligned}

    which directly implies

    \nabla h(x) = Ab^T .

    To see this computed directly, first compute

    \begin{aligned} b^TAx &= \begin{bmatrix}b_1& \cdots & b_m \end{bmatrix} \begin{bmatrix} a_1^T\\ \vdots \\ a_m^T \end{bmatrix} \begin{bmatrix}x_1\\\vdots\\x_n\end{bmatrix}\\ &=\begin{bmatrix}b_1& \cdots & b_m \end{bmatrix} \begin{bmatrix} a_1^Tx\\\vdots\\ a_m^T x\end{bmatrix}\\ &= b_1a_1^Tx + \cdots + b_ma_m^Tx. \end{aligned}

    Computing

    \begin{aligned} \frac{\partial}{\partial x_k} b^TAx &= \frac{\partial}{\partial x_k}(b_1a_1^Tx + \cdots + b_ma_m^Tx)\\ &= b_1 a_{1k} + \cdots + b_m a_{mk}, \end{aligned}

    we conclude

    \begin{aligned} \nabla b^TAx &= \begin{bmatrix}  b_1 a_{1k} + \cdots + b_m a_{mk}\\ \vdots\\ b_1 a_{1m} + \cdots + b_m a_{mm} \end{bmatrix}\\ &= A^Tb. \end{aligned}