ECSE 507 Lecture Notes


These are notes used for lecturing ECSE 507 during the Winter 2024 semester.
Please note that these notes constantly evolve and beware of typos/errors.
These notes are heavily influenced by Boyd and Vandenberghe’s text Convex Optimization.
I appreciate being informed about typos/errors in these notes.


Click each subject to unfold.

Introduction
General Problem Interested in solving and studying minimization problems subject to constraints:

\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & x \in D \end{cases},

where
  • x are problem parameters;
  • f_0:\mathbb{R}^n \to \mathbb{R} is the objective function and f_0(x) usually interpreted as cost for choosing x ;
  • D \subset\mathbb{R}^n is a constraint set often described “geometrically”.
Example 1. Design Problem Interpretation Let
  • x,y=scalar-valued design variables
    (e.g., dimensions of manufactured object, yaw and pitch of jet).
  • f_0(x,y) = penalty for choosing design (x,y)
    (e.g., cost in material, energy, time, deviation from desired path).
  • D = design specifications
    (i.e., allowable/possible values for (x,y)).
    E.g., D = \{ (x,y) : a<x<b, c<y<d\} specifies minimum and maximum design values.
Then the problem is to find optimal design values (x,y) which minimize cost f_0 and satisfy the design specifications (x,y) \in D.


Example 2. Minimize function over ellipse Let
f_0(x,y) = 1 - \frac{\left(x^{2}+y^{2}\right)}{2}
D = \{ (x,y) \in \mathbb{R}^2 : x^2 + 2 y^2 \leq 1 \} .
Then the problem

\begin{cases} \text{minimize} & 1-\frac{\left(x^{2}+y^{2}\right)}{2}\\ \text{subject to} & x^2 + 2 y^2 \leq 1 \end{cases}

is a familiar kind of purely geometric optimization problem.
It has two solutions: (x,y,f_0(x,y)) = (\pm 1, 0 , \frac{1}{2}) .
Many applied problems can look like this, but with an applied interpretation.
Problems With Structure General optimization problems can be numerically inefficient to solve or analytically difficult, unless f_0 and D have additional structure/properties.
Identifying nice structure/properties of problem \implies problem may become analytically solvable or numerically efficient.
Examples of nice structure:
linearity
f_0(x) = c^T x
D defined in terms of linear equalities/inequalities;
e.g., a_1^T x \geq 0, \ldots, a_m^T x \geq 0, or succinctly written A x \succeq 0.


convexity or quasiconvexity
E.g., convex if f_0(tx+(1-t)y) \leq tf_0(x) + (1-t)f_0(y)
0<t<1
D is itself a convex set.


sparsity or other matrix structure
E.g., f_0(x) = x^T C x and D given by Ax \succeq 0
If C,A sparse
\implies lots of cancellation
\implies improve computational efficiency
\begin{bmatrix} 1 & 0 & 1 & 0 & 0 & 1\\ 0 & 1 & 0 & 0 & 0 & 0\\ 1 & 0 & 0 & 1 & 0 & 1\\ 0 & 1 & 0 & 1 & 0 & 0\\ 0 & 1 & 0 & 0 & 1 & 0 \end{bmatrix} ,\qquad \begin{bmatrix} 1&2&0&0\\ 2&1&0&0\\ 0&0&2&1\\ 0&0&1&2 \end{bmatrix}
Linear Programming Simplest case: f_0 is linear and D defined in terms of linear constraints.
Notation: if x, y \in \mathbb{R}^n , then

x \succeq y means x_i \geq y_i for i =1,\ldots, n .

c^T = transpose of the column vector c \in \mathbb{R}^n.
Linear program: given c \in \mathbb{R}^n, b \in \mathbb{R}^m, A \in \mathbb{R}^{m \times n}, solve

\begin{cases} \text{minimize}&c^T x\\ \text{subject to}& Ax \succeq b\\ &x \succeq 0 \end{cases}.

Thus f_0(x) = c^T x and D = \{x \in \mathbb{R}^n: Ax \succeq b, x \succeq 0\}.

Positives of linear programming:
  • Conceptually simple: relies heavily on linear algebra
  • There are classical numerical methods which are often very efficient.
  • If x^\star \in D is local minimizer of f_0 on D, then it is automatically a global minimizer on D.
  • Can sometimes approximate smooth problems linearly; however, usually can only give “local” results.
    (E.g., \sin(x) \sim x for 0 \leq x \ll 1 .)


Shortcomings of linear programming:
  • Many applied problems are not linear.
  • Many problems may not even be (suitably) approximated by linear programs.
    E.g., the “barrier”

    I_{-}(x) =  \begin{cases} 0 & x <0\\ \infty & x\geq0 \end{cases}

    is better approximated by the a “logarithmic barrier” of the form - c \log(-x) than any linear function.
Convex Optimization Convex optimization problem: f_0 and D are convex.
This is the main focus of the course.

Positives of Convex Optimization:
  • Relatively conceptually simple.
  • Still often have efficient, albeit more sophisticated, numerical methods.
  • Many applied problems may be recast as or approximated by convex optimization problems.
  • If x^\star \in D is local minimizer of f_0 on D, then it is automatically a global minimizer on D.


Shortcomings of Convex Optimization:
  • Problems may seriously fail to be analytically tractable or numerically efficient.
  • Exists nonlinear problems which cannot be approximated by convex problems.
  • Example (Least Squares) A standard and ubiquitous kind of convex optimization problem is the least squares problem.
    This problem takes the form:

    \begin{cases} \text{minimize} & \Vert Ax-b \Vert^2\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ &h_i(x) =0, i=1,\ldots,p \end{cases},

    where
    • \Vert \cdot \Vert is some norm
    • A \in \mathbb{R}^{m \times n} a matrix
    • b \in \mathbb{R}^m a fixed vector
    • f_0(x) = \Vert A x-b \Vert^2 is the (convex) objective function
    • f_i,h_i are convex.
    N.B.: will come back to this problem.
    Example: Distance from points to ellipse If \Vert x \Vert^2 = x_1^2 + \cdots + x_n^2, A is the identity matrix, and D is an ellipsoid, then the solution is the point in the ellipsoid closest to the point b.
    The image below depicts this situation with

    b = \begin{bmatrix}0\\2\end{bmatrix} and D = \{(x,y): (x-2)^2 + 2(y-2)^2 \leq 1 \} .

    Here, the optimal solution is (x,y) = (2-\sqrt{2},2) .
    Optimal Control Let
    x(t),u(t),a(t),f(t) be functions \mathbb{R} \to \mathbb{R} .
    Assume x(t) evolves by

    \begin{aligned}  \dot x(t) &= a(t) x(t) + f(t) u(t)\\ \end{aligned}

    Here,
    • \dot x(t) is the time derivative of x(t) .
    • a(t),f(t) are assumed to be given;
    • we think of x(t) as being the state of some system at time t ;
    • we think of u(t) as an input we are allowed to choose to dictate the evolution x(t) of x(0) ; i.e., u(t) “controls” the system;
    • When u(t)=u(x(t),t) , the system experiences “feedback.”
    • the goal: choose control law u(t) = u(x(t),t) so that x(t) is as “desirable” as possible.
    Example: Optimal Control Problems Optimal control problem: choose the “best” control u(t) which gives the “most” desirable x(t) .
    Typically “best” and “desirable” are determined by size/cost of u(t) and x(t) ; e.g., one may wish to minimize

    \int u(t)^2 + x(t)^2 dt.

    Therefore: problem is to solve (roughly speaking)

    \begin{cases} \text{minimize} & \int u(t)^2 + x(t)^2 dt\\ \text{subject to} & x'(t) = a(t) x(t) + f(t) u(t)\\ \end{cases}

    This will be another focus of the course and we will see some optimal control problems can be recast as convex optimization problems.
    Rough Outline of Course
    Part 1: Basics of Convexity and Convex Optimization Problems.
    Part 2: Applications of Convex Optimization Problems.
    Part 3: Algorithms for Solving Convex Optimization Problems.
    Part 4: Topics in Optimal Control.
    Convex Geometry
    Convex Sets Convex set: a subset X \subset \mathbb{R}^n satisfying:

    for all x,y \in X and t \in [0,1], there holds tx + (1-t) y \in X.

    I.e., X contains all line segments whose endpoints belong to X.

    Examples: some standard convex sets.

    1. Closed or open polytopes in \mathbb{R}^n.
      E.g., the interior of a tetrahedron.
    2. Euclidean balls, ellipsoids.
    3. Linear subspaces and affine spaces (e.g., lines, planes).
    4. Given a norm \Vert \cdot \Vert on \mathbb{R}^n, the \Vert \cdot \Vert-ball

      \{ x \in \mathbb{R}^n : \Vert x - x_{0} \Vert \leq r \}

      with center x_0 \in \mathbb{R}^n and radius r>0 is a convex set.
      Recall: a norm satisfies
      • \Vert x+ y \Vert \leq \Vert x \Vert + \Vert y \Vert for all vectors x,y ;
      • \Vert c x \Vert = |c| \Vert x \Vert for all vectors x and scalars c ;
      • \Vert x \Vert=0 iff x =0.
    Affine Subsets Affine subset: A subset X \subset \mathbb{R}^n satisfying:

    For all x,y \in X and t \in \mathbb{R}, there holds tx + (1-t) y \in X.

    I.e., X contains all lines which pass through two distinct points in X.
    N.B.: An affine subset is just a translated linear subspace:
    “a linear space that’s forgotten its origin”.
    Example 1. Let X = \{ (x,y,0): x,y \in \mathbb{R} \} be the xy-plane in \mathbb{R}^3.
    Then any translation or rotation of X is an affine subset.
    Example 2. If A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m, then X=\{x \in \mathbb{R}^n: Ax = b\} is affine.
    (X is just a translate of \text{ker}\,A.)
    Example 3. Let

    A =  \begin{bmatrix} 0&0&0\\ 0&0&0\\ 0&0&1 \end{bmatrix}\qquad \text{ and } \qquad b =  \begin{bmatrix} 0\\0\\3 \end{bmatrix} .

    Then the solution set to Ax=b is

    \{(x,y,3): x,y \in\mathbb{R}\},

    which is just \text{ker}\, A translated by 3 in the z-direction:
    Cones Cone: A subset X \subset \mathbb{R}^n satisfying:

    For all x \in X and t \geq 0, there holds tx \in X.

    I.e., X contains all “positive” rays emanating from the origin and passing through any of its points.
    Proposition. X \subset \mathbb{R}^n is a convex cone iff for all x_1,x_2 \in X and \theta_1,\theta_2\geq0, there holds \theta_1 x_1 + \theta_2 x_2 \in X.
    Proof.
    Step 1. (\implies ) Suppose X is a convex cone and let x_1,x_2 \in X and \theta_1,\theta_2\geq0 be arbitrary.
    Want to show: \theta_1 x_1 + \theta_2 x_2 \in X .
    Step 2. Being conic implies x = \frac{\theta_1}{t} x_1 and y=\frac{\theta_2}{1-t}x_2 belong to X for all 0<t<1.
    Step 3. Being convex implies tx+ (1-t)y = \theta_1 x_1 + \theta_2 x_2 \in X, as desired.
    Step 4. (\impliedby ) Suppose X is such that \theta_1 x_1 + \theta_2 x_2 \in X for all x_1,x_2 \in X and \theta_1,\theta_2\geq0.
    Want to show: X is a convex cone.
    Step 5. X being conic follows from taking \theta_1\geq0 arbitrary and \theta_2 = 0.
    Step 6. Convexity follows from taking \theta_1 + \theta_2 = 1 with \theta_1,\theta_2 \geq0 and 0 \leq \theta_1 \leq 1 .
    Indeed: \theta_1 + \theta_2 = 1 and t := \theta_1 \implies t x_1 + (1-t) x_2 \in X for 0 \leq t \leq 1 since 1-t = \theta_2 .

    Examples.

    1. Hyperplanes \{ x \in \mathbb{R}^n : a^{T} x = 0\} with normal a \in \mathbb{R}^n,
      halfspaces \{ x \in \mathbb{R}^n : a^{T}x \leq 0 \},
      nonnegative orthants \{x \in \mathbb{R}^n: x \succeq 0\} are all convex cones.
      (Here, u \succeq v if u_i \geq v_i for i = 1,\ldots,n.)
    2. Given a norm \Vert\cdot\Vert on \mathbb{R}^n, the \Vert\cdot\Vert-norm cone is

      \{ (x,t) \in \mathbb{R}^{n+1} : \Vert x \Vert \leq t \},

      which is a convex cone in \mathbb{R}^{n+1}.
    3. See “positive semidefinite cone” below.
    Polyhedra Polyhedron: Any subset X \subset \mathbb{R}^n of the form

    X = \{ x \in \mathbb{R}^n : a_j^T x \leq b_j,\, c_i^T x = d_i \}
    j=1,\ldots,m
    i=1,\ldots,p

    given the vectors a_j,c_i \in \mathbb{R}^n and scalars b_j,d_i \in \mathbb{R}.
    Thus, X is a finite intersection of halfspaces and hyperplanes.
    N.B.: Introducing equality constraints can be used to reduce dimension.
    Example. The polyhedron below is given by the indicated system of inequalities:



    \begin{aligned} y-x& \leq0\\ -x-y&\leq -1\\ x &\leq 3\\ -x &\leq -1 \end{aligned}

    It is an easy exercise to rewrite the inequalities in the notation a_j^T x \leq b_j for suitable a_j,b_j .
    Positive Semidefiniteness
    Symmetric matrix: a matrix X \in \mathbb{R}^{n \times n} satisfying X = X^T ; i.e.,

    X = \begin{bmatrix}  x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \cdots & x_{nn} \end{bmatrix}  = \begin{bmatrix} x_{11} & x_{21} & \cdots & x_{n1} \\ x_{12} & x_{22} & \cdots & x_{n2}\\ \vdots & \vdots & \ddots & \vdots\\ x_{1n} & x_{2n} & \cdots & x_{nn} \end{bmatrix}  = X^T.

    Set of symmetric matrices:

    \boldsymbol{S}^n = \{ X \in \mathbb{R}^{n \times n}: X = X^T \}.


    Positive semidefinite matrix: X \in \boldsymbol{S}^n satisfying z^T X z \geq 0 for all z \in \mathbb{R}^n.
    Equivalently, X only has nonnegative eigenvalues.
    If X \in \boldsymbol{S}^n is positive semidefinite, then write X \succeq 0 .
    If X,Y \in \boldsymbol{S}^n and X-Y\succeq0 , then write X \succeq Y .
    Set of symmetric positive semidefinite matrices:

    \boldsymbol{S}_+^n = \{ X \in \boldsymbol{S}^n: X \succeq 0 \}.

    (N.B.: \succeq is not the same as component-wise inequality, as was the case for vectors.)

    Positive definite matrix: X \in \boldsymbol{S}_+^n satisfying z^T X z = 0 if and only if z=0.
    Equivalently, X only has positive eigenvalues.
    If X \in \boldsymbol{S}^n_+ is positive definite, then write X \succ 0 .
    If X,Y \in \boldsymbol{S}^n_+ and X-Y\succ0 , then write X \succ Y .
    Set of symmetric positive definite matrices:

    \boldsymbol{S}_{++}^n = \{ X \in \boldsymbol{S}^n: X \succ 0 \}.



    Example 1 Let A = \begin{bmatrix} 2 & 0 \\  0 & 4  \end{bmatrix} .
    Since A has positive eigenvalues \{2,4\} , it follows that A \succ 0 .
    To see A \succ 0 explicitly, observe

    \begin{aligned}  z^T A z &= \begin{bmatrix} z_1 & z_2 \end{bmatrix}  \begin{bmatrix} 2 & 0\\ 0 & 4 \end{bmatrix}  \begin{bmatrix} z_1 \\ z_2 \end{bmatrix}\\ &= 2 z_1^2 + 4 z_2^2\\ &\geq0 \end{aligned}


    with z^T A z = 0 iff z = \begin{bmatrix}0\\0\end{bmatrix} .
    Example 2. Let B = \begin{bmatrix} 4 & 0 \\ 0 & 0 \end{bmatrix}.
    Since B has nonnegative eigenvalues \{ 4,0 \} , it follows that B \succeq 0 .
    To see B \succeq 0 explicitly, observe

    \begin{aligned} z^TBz &= \begin{bmatrix}z_1 & z_2 \end{bmatrix} \begin{bmatrix} 4 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} z_1 \\ z_2 \end{bmatrix}\\ &=4 z_1^2\\ &\geq 0. \end{aligned}

    Evidently, z^T B z=0 for z = \begin{bmatrix} 0\\z_2 \end{bmatrix} and so B \not\succ 0.
    Example 3. Let C =\begin{bmatrix} 3 & 2 \\ 2 & 3 \end{bmatrix} \succ 0 .
    One can conclude C \succ0 by showing that C has positive eigenvalues \{ 5,1 \} .
    To see it directly, compute

    \begin{aligned} z^T C z &= \begin{bmatrix}z_1 & z_2 \end{bmatrix} \begin{bmatrix} 3&2\\2&3 \end{bmatrix} \begin{bmatrix} z_1 \\z_2 \end{bmatrix}\\ &= 3z_1^2 + 4z_1z_2 + 3z_2^2. \end{aligned}

    But the discriminant (with respect to z_1 ) of this quadratic satisfies -20 z_2^2 \leq 0 , from which we conclude the polynomial is positive unless z_1=z_2=0 and hence C \succ 0 .
    Example 4. Let D = \begin{bmatrix} 1 & 2 \\ 2 & 1 \end{bmatrix} .
    We can conclude D \not\succeq 0 , either compute its eigenvalues \{3,-1\} or observe that \det D = 1 - 4 = -3 and whence D cannot have only nonnegative eigenvalues.
    Positive Semidefinite Cone Proposition 1. \boldsymbol{S}^n is a \frac{n(n+1)}{2}-dimensional real vector space and \boldsymbol{S}_+^n is a convex cone in \boldsymbol{S}^n.
    Proof.
    Step 1. \boldsymbol{S}^n is a vector space: if X,Y \in \boldsymbol{S}^n and c \in \mathbb{R}, then it is easy to see:

    (X+cY)^T = X^T + cY^T = X + cY.

    and so X+cY \in \boldsymbol{S}^n.


    Step 2. \text{dim}\,\boldsymbol{S}^n = \frac{n(n+1)}{2}: since X \in \boldsymbol{S}^n implies X=X^T, we have the identification

    \begin{aligned} X&= \begin{bmatrix} \boldsymbol{x_{11}} & \boldsymbol{x_{12}} &  \boldsymbol{x_{13}} & \cdots & \boldsymbol{x_{1n}}\\ x_{12} & \boldsymbol{x_{22}} & \boldsymbol{x_{23}} &\cdots & \boldsymbol{x_{2n}}\\ x_{13} & x_{23} & \boldsymbol{x_{33}} &\cdots & \boldsymbol{x_{3n}}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ x_{1n} & x_{2n} & x_{3n} & \cdots & \boldsymbol{x_{nn}}\\ \end{bmatrix}\\ &\qquad\qquad\qquad\qquad\iff\\ \xi&:= \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} & x_{22} & x_{23} & \cdots & x_{2n} & \cdots & x_{nn} \end{bmatrix}^T, \end{aligned}

    where the bolded entries in X indicate the unique contributions to making \xi.
    Counting the number of bold entries shows \xi has \frac{n(n+1)}{2} entries and hence \xi \in \mathbb{R}^{\frac{n(n+1)}{2}}.


    Step 3. \boldsymbol{S}_+^n is a convex cone: For \theta_1,\theta_2\geq0, X,Y \in \boldsymbol{S}_+^n and z \in \mathbb{R}^n there holds

    z^T(\theta_1 X +\theta_2 Y)z = \theta_1 z^T X z + \theta_2 z^T Y z \geq 0,

    and so \theta_1 X + \theta_2 Y \in \boldsymbol{S}_+^n .
    By the proposition in Convex Geometry.Cones, we conclude the desired result.


    Proposition 2. \begin{bmatrix} a&b\\b&c \end{bmatrix} \in \boldsymbol{S}_+^2 iff

    a,c\geq0 and \det \begin{bmatrix} a&b\\b&c \end{bmatrix} = ac - b^2 \geq 0.

    Proof.
    Step 1. Let

    x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \text{and} \quad  X = \begin{bmatrix} a&b\\b&c \end{bmatrix},


    recalling that X \in \boldsymbol{S}_+^2 iff x^T X x \geq 0 for all x \in \mathbb{R}^2.

    Step 2. (Case a=0 ) First compute

    x^T X x = \begin{bmatrix}x_1 & x_2 \end{bmatrix} \begin{bmatrix} 0&b\\b&c \end{bmatrix}\begin{bmatrix} x_1\\x_2\end{bmatrix} = 2bx_1x_2+cx_2^2.

    Observe that

    2bx_1x_2 + cx_2^2 \geq 0 for all x_1,x_2 \in \mathbb{R} iff b=0,c\geq0 .

    Note (in case a=0 ): ac - b^2 = 0 iff b=0 .
    Can thus conclude (in case a=0 ):

    x^TXx \geq 0 for all x \in \mathbb{R}^n iff c\geq0, \det X =0 .



    Step 3. (Case a \neq 0 ) Completing the square gives

    \begin{aligned}  x^T X x &= \begin{bmatrix}x_1&x_2\end{bmatrix}\begin{bmatrix}a&b\\b&c\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix}\\ &=ax_1^2 + 2bx_1 x_2 + c x_2^2 \\ &= a(x_1 + a^{-1}bx_2)^2 + a^{-1}\det\, X\, x_2^2. \end{aligned}

    But

    a(x_1 + a^{-1}bx_2)^2 + a^{-1}\det\, X\, x_2^2 \geq0 for all x_1,x_2 \in \mathbb{R}

    iff

    a>0 and \det\,X \geq0 .

    (To conclude a>0 , take x_2=0.)
    N.B.: strictly speaking, c\geq0 was not used anywhere; however, X \succeq0 immediately implies c \geq0 , and a>0,ac-b^2 \geq0 also implies c\geq0 .

    Step 4. Putting Steps 2. and 3. together, we conclude:
    \begin{bmatrix} a&b\\b&c \end{bmatrix} \in \boldsymbol{S}_+^2 iff

    a,c\geq0 and \det \begin{bmatrix} a&b\\b&c \end{bmatrix} = ac - b^2 \geq 0.



    Image of \boldsymbol{S}_+^2 In light of Proposition 1 and Proposition 2, we can plot

    \boldsymbol{S}_+^2 \subset \boldsymbol{S}^2 \cong \mathbb{R}^3.

    The image below depicts the boundary of

    \boldsymbol{S}_+^2 \cong \{ (x,y,z): xz \geq y^2 ,x\geq0,z\geq0\}.

    Click to enlarge
    Separating Hyperplanes Let A,B \subset \mathbb{R}^n be two sets.
    Separating hyperplane: a hyperplane given by

    P_{a,b} := \{ x : a^T x = b \}

    for some a \in \mathbb{R}^n and b \in \mathbb{R} such that

    \begin{aligned} a^Tx -b \geq 0 & \text{ on } A\\ a^Tx - b \leq 0 & \text{ on } B \end{aligned}.

    P_{a,b} is said to separate A and B .
    Thus, P_{a,b} cuts \mathbb{R}^n into two halfspaces with one containing all of A and the other containing all of B .
    Separating Hyperplane Theorem. If A,B \subset \mathbb{R}^n are two disjoint convex sets, then there exists a \in \mathbb{R}^n and b \in \mathbb{R} such that P_{a,b} is a separating hyperplane which separates A and B .
    Example 1. Consider the convex sets

    \begin{aligned} A &= \{(x,y) : \left(x-1\right)^{2}+2\left(y-1\right)^{2\ }\leq1 \}\\ B &= \{(x,y) : \left(x+1\right)^{2\ }+\ \left(y+1\right)^{2\ }\leq1\}\\ C &= \{(x,y) : \left(x-2\right)^{2}+2\left(y-1\right)^{2\ }\leq1 \}\\ P:&=P_{(1,1),0} = \{(x,y) : y+x=0 \} \end{aligned}.

    These three sets are indicated in the image below.
    Note that P separates the pairs [B,A] and [B,C] . Moreover, the pair [A,C] cannot be separated since A and C have significant overlap.
    Example 2. Consider the convex sets

    \begin{aligned} A &= \{(x,y) : x^{2\ }+\ y^{2}<1 \}\\ B &= \{(x,y) : \left(x+2\right)^{2\ }+y^{2}<1 \}\\ P:&=P_{(1,0),-1} = \{(x,y) : x=-1 \} \end{aligned}.

    These sets are indicated in the image below.
    First note A \cap B = \emptyset since neither contain their boundaries.
    As such, they have a separating hyperplane which is given by P .

    N.B.: Replacing A,B with their respective closures \overline{A},\overline{B} , the plane P still separates \overline{A},\overline{B} .
    Indeed, x+1 \geq 0 for (x,y) \in A and x+1 \leq 0 for (x,y) \in B .
    Supporting Hyperplanes Let A \subset \mathbb{R}^n be a fixed set and fix a boundary point

    x_0 \in \text{bd}A := \overline{A} \setminus \text{int}A .

    If the plane

    P_{a,a^Tx_0} = \{ x : a^Tx = a^Tx_0 \}

    separates A and the singleton \{x_0\} , then P_{a,a^Tx_0} is called the supporting hyperplane of A at x_0 .
    Equivalently, A lies entirely in a halfspace with boundary given by P_{a,a^Tx_0} .

    (Here: \overline{A} indicates the closure of A and \text{int} A indicates its interior.)

    Example. Consider the convex sets

    \begin{aligned} A &= \{ (x,y) : (x-2)^2 + (y-2)^2 \leq 1 \}\\ P :&= P_{(-1,0),-1} = \{ (x,y) : x = 1 \} \end{aligned}

    with boundary point x_0 = (1,2) \in \partial A .
    Letting a = \begin{bmatrix} -1\\0\end{bmatrix} , we note a^T x_0 + 1 = 0 \geq 0 .
    Next, observing that, if (x,y) \in A , then x \geq 1 and so

    a^Tx + 1 = \begin{bmatrix} -1&0 \end{bmatrix} \begin{bmatrix} x\\y \end{bmatrix} +1 = -x +1 \leq 0.

    Thus P separates A and \{ x_0 \} , showing that P is a supporting hyperplane of A at the boundary point x_0 ; see image below.
    Hulls Let X \subset \mathbb{R}^n be a fixed subset.
    Convex hull: the set

    \{ \theta_1 x_1 + \cdots + \theta_k x_k : x_i \in X,\, \theta_i \geq 0,\, i = 1,\ldots,k,\, \theta_1 + \cdots + \theta_k= 1\}.

    This is just the collection of all convex combinations of points in X and is itself convex.
    Example: The images below depict a set of three points and its convex hull.
    3 points in the plane
    Convex hull of 3 points


    Affine hull: the set

    \{ \theta_1 x_1 + \cdots + \theta_k x_k : x_i \in X,\, i = 1,\ldots,k,\, \theta_1 + \cdots + \theta_k= 1\}.

    This is just the collection of all affine combinations of points in X and is itself affine.
    Example: The images below depict two points and their affine hull.
    Two points in the plane
      
    Affine hull of two points


    Conic hull: The set

    \{ \theta_1 x_1 + \cdots + \theta_k x_k : x_i \in X,\, \theta_i \geq 0,\, i = 1,\ldots,k\}.

    This is just the collection of all conic combinations of points in X and is itself a cone.
    Example: The images below depict two points x_1 = (0.5,1) , x_2 = (1.5,0.5) and their conic hull.
    Two points in the plane
      
    Conic hull of two points
    Details To see that the conic hull really is the shaded region, note that, by taking \theta_1 = t s_1 and \theta_2 = (1-t)s_2 , where t \in [0,1] and s\geq0 , the conic hull contains all points of the form ts_1x_1 + (1-t)s_2x_2 .
    Thus, it contains all line segments connecting any two points on the nonnegative rays \{ s x_1 : s \geq 0 \} and \{ s x_2 : s \geq 0 \} .


    N.B.:
    1. Conic hulls are convex cones.
    2. Taking the “___ hull” of X does indeed result in a “___” set.
    3. The “___ hull” is a construction of the smallest “___” subset containing X.
    Generalized Inequalities Proper cone: a convex cone K \subset \mathbb{R}^n satisfying
    1. K is closed (i.e., K contains its boundary)
    2. K has nonempty interior
    3. x,-x \in K \implies x=0 .
    Generalized inequality: given a proper cone K , a partial ordering \prec_K on \mathbb{R}^n defined by

    x \preceq_K y \iff y-x \in K .

    N.B.: \preceq_K is a partial ordering and so x \preceq_K y is not well-defined for all x,y . Generalized strict inequality: given a proper cone K , a partial ordering \prec_K on \mathbb{R}^n defined by

    x \prec_K y \iff y-x \in \text{int}\,K .

    Examples
    1. (CO Example 2.14)
      If K = \mathbb{R}^n_+ , then \preceq_K is the standard componentwise vector inequality:

      v \preceq w \iff v_i \leq w_i, \, i = 1,\ldots,n .

      N.B.: \preceq_{\mathbb{R}_+} is the standard inequality on \mathbb{R} .
    2. (CO Example 2.15)
      If K = \boldsymbol{S}_+^n , then

      \begin{aligned} A &\preceq_K B \implies B-A \text{ is positive semidefinite}\\ A &\prec_K B \implies B-A \text{ is positive definite} \end{aligned} .

    3. Let

      K = \{(x_1,x_2) : x_1 \leq 2x_2, x_2 \leq 2x_1 \} .

      Then K is a proper cone.
      In the image below:
      • K is the cone with vertex (0,0) .
      • The cone with vertex x = (-1,1) depicts those y \in \mathbb{R}^2 with x \preceq_K y .
      • The cone with vertex x = (1,-1) depicts those y \in \mathbb{R}^2 with x \preceq_K y .
      N.B.: (0,2)-(-1,1) = (1,1) \in K , and so (1,1) \preceq_K (0,2) , as indicated in the image.
      Moreover, (-1,1) and (1,-1) are not comparable.
    Convex Function Theory
    Conventions and Notations
    1. Writing f:\mathbb{R}^n \to \mathbb{R} always means a partial function with domain \text{dom}\,f possibly smaller than \mathbb{R}^n.
      “Function” will mean “partial function.”
    2. If \text{dom}\, f \neq \mathbb{R}^n, we may work with the extension \tilde{f}:\mathbb{R}^n \to \mathbb{R} \cup \{ +\infty \} given by

      \tilde{f}(x) = \begin{cases} f(x) & x \in \text{dom}\, f\\ +\infty & x \notin \text{dom}\, f \end{cases}.

      It is common to implicitly assume f has been extended and to write f for the partial function f and its extension \tilde{f}.
    3. Given a set C \subset \mathbb{R}^n, its indicator function is

      \tilde{I}_C = \begin{cases} 0 & x \in C\\ +\infty & x \notin C \end{cases}.

    4. We write

      \begin{aligned} \mathbb{R}_+ &:= \{ x \in \mathbb{R}: x\geq0 \}\\ \mathbb{R}_{++} &:= \{ x \in \mathbb{R}: x > 0 \} \end{aligned}.

    Convex Functions Let f:\mathbb{R}^n \to \mathbb{R} be a function with convex domain \text{dom}\, f.
    Convexity: for all x,y \in \text{dom}\, f, t \in [0,1] there holds

    f(t x + (1-t)y) \leq t f(x) + (1-t)f(y)

    (This inequality is often called Jensen’s inequality.)



    Strict convexity: for all x,y \in \text{dom}f,\,  x\neq y, t \in (0,1) there holds

    f(t x + (1-t)y) < t f(x) + (1-t)f(y).


    Example: failure of strict convexity In the figure, the solid line indicates part of the graph of x^4 and the dashed line indicates part of the graph of a linear function.
    This function fails to be everywhere strictly convex due to linear functions satisfying

    f(tx + (1-t)y) = tf(x) + (1-t)f(y) .





    Concavity and strict concavity: when -f is, respectively, convex and strictly convex.



    Remarks.
    1. It is instructive to compare convexity/concavity with linearity and view the former as weak versions of linearity.

    2. It is common to extend the definition of convexity to extended functions, i.e., those of the form f: \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\} .

      For example, the indicator function \tilde{I}_{(-\infty,2)} is convex in this sense.
      To give insight, consider the image below, where the thick line is the “graph” of \tilde{I}_{(-\infty,2)} and the dashed line is the “secant line” connecting the points (1,0) to (x,\infty) for any x\geq 2 .

    Examples
    1. All linear functions are convex and concave on their domains.
    2. e^x is convex on \mathbb{R}.
    3. |x|^p is convex on \mathbb{R} for p \geq 1 .
    4. x^p is convex on \mathbb{R}_{++} for p \geq 1 or p \leq 0 and concave for 0 \leq p \leq 1.
    5. - \log\det X is convex on \boldsymbol{S}_{++}^n
    6. If C \subset \mathbb{R}^n is convex, then its indicator function \tilde{I}_C is convex (in the extended value sense).
    One Dimensional Characterization Proposition. Let f:\mathbb{R}^n \to \mathbb{R} have convex domain and, given

    x \in \text{dom}\, f and v \in \mathbb{R}^n ,

    define the function g:\mathbb{R} \to \mathbb{R} by

    g(t) = f(x+tv)

    with

    \text{dom}\,g := \{ t \in \mathbb{R} : x + tv \in \text{dom}\, f \}.

    Then f is convex iff g is convex for all x \in \text{dom}\,f and v \in \mathbb{R}^n such that g is well-defined.
    Proof.
    Step 1. First note that \text{dom}\,g is convex: it is the intersection of \text{dom}\,f with the line passing through x  with direction v .
    Step 2. (\implies ) Suppose f is convex and let x \in \text{dom}\,f and v \in \mathbb{R}^n be arbitrary.
    Then, for \theta \in [0,1] and t_1,t_2 \in \text{dom}\, g, there holds

    \begin{aligned} g(\theta t_1 + (1-\theta) t_2) &= f(x + (\theta t_1 + (1-\theta)t_2)v)\\ &= f(x + \theta t_1 v + (1-\theta) t_2 v)\\ &= f(\theta(x + t_1 v) + (1-\theta)(x+t_2 v) ) \\ &\leq \theta f(x+t_1 v) + (1-\theta)f(x + t_2 v)\\ &= \theta g(t_1) + (1-\theta)g(t_2), \end{aligned}

    proving that g is convex.
    Step 3. (\impliedby ) Suppose now that each g is convex.
    Fix x,y \in \text{dom}\,f and let \theta \in [0,1] .
    We want to show

    f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y) .

    Let v = y-x and

    g(t) = f(x + t(y-x)),

    noting that g(0)=f(x), g(1)=f(y).
    Since g is convex, we conclude

    \begin{aligned} f(\theta x + (1-\theta)y) &= f(x + (1-\theta)(y-x))\\ &= g(1-\theta)\\ &=g(\theta \cdot 0 + (1-\theta)\cdot 1)\\ &\leq \theta g(0) + (1-\theta) g(1)\\ &=\theta f(x) + (1-\theta) f(y). \end{aligned}

    This is enough to conclude f is convex.
    First Order Characterization Proposition. If f:\mathbb{R}^n \to \mathbb{R} is differentiable with convex domain \text{dom}\, f, then f is convex iff

    f(x) + \nabla f(x)^{T}(y-x) \leq f(y), \quad \forall x,y \in \text{dom}\, f.


    Proof (sketch). We prove it in case f:\mathbb{R} \to \mathbb{R} ; the higher dimensional case follows by using that f:\mathbb{R}^n \to \mathbb{R} with convex domain is convex iff it is convex as a single variable function when restricted to lines intersecting \text{dom}\,f .
    Throughout, let x,y \in \text{dom}\,f and t \in [0,1] .
    Step 1. (\implies ) If f is convex, then we obtain the following inequalities

    \begin{aligned} f(ty + (1-t)x) \leq tf(y) + (1-t)f(x)\quad&\text{(convexity)}\\ f(x) + \frac{f(x+t(y-x))-f(x)}{t}  \leq f(y)\quad&\text{(rearranging)}\\ f(x) + \frac{f(x+t(y-x))-f(x)}{t(y-x)}(y-x)  \leq f(y)\quad &\text{(rearranging)}\\ f(x) + f'(x)(y-x)  \leq f(y)\quad&\text{(taking }t \to 0 \text{)}. \end{aligned}

    Step 2. (\impliedby ) Supposing

    f(x) + f'(x)(y-x) \leq f(y)

    we set z = tx + (1-t)y and add the two inequalities

    \begin{aligned} t f(z) + tf'(z)(x - z) &\leq tf(x)\\ (1-t) f(z) + (1-t)f'(z)(y - z) &\leq (1-t)f(y) \end{aligned}

    to obtain

    f(tx+(1-t)y)=f(z) \leq tf(x) + (1-t)f(y).




    Remarks.
    1. For fixed x \in \text{dom}\,f, the mapping

      A_x: \, y \mapsto f(x) + \nabla f(x)^{T}(y-x)

      is affine whose graph is a hyperplane passing through the point (x,f(x)) .
      Therefore, the inequality A_x(y) \leq f(y) means this hyperplane is a tangent plane at (x,f(x)) of the graph of f lying under the graph of f .
      In fact, this plane is a supporting hyperplane of the epigraph

      \text{epi}\,f := \{ (x,t) \in \mathbb{R}^{n} \times \mathbb{R} : x \in \text{dom}\,f, t \geq f(x)\}

      at the point (x,f(x)) .


    2. The affine mapping A_x is just the first order Taylor approximation of f at x .
      Thus, differential convex functions are such that their first order Taylor approximations serve as a global underestimators of f .

    Example. In the image below:
    • solid line is the graph of f(x) = e^x ;
    • shaded region is the convex set given by \text{epi}\,f = \{ (x,t): t\geq e^x \} ;
    • dashed line is the supporting hyperplane at (-1,e^{-1}) given by the graph of e^{-1} + e^{-1}(x+1) .


    f(x) + f'(x)(y-x) gives a supporting hyperplane
    Second Order Characterization Proposition. If f is twice-differentiable with \text{dom}\,f convex, then f is convex iff

    \nabla^2 f(x) \succeq 0, \quad \forall x \in \text{dom}\, f.


    Recall: if f: \mathbb{R}^n \to \mathbb{R} is twice-differentiable, then its Hessian is

    \nabla^2 f(x) =  \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(x) \\ \frac{\partial^2 f}{\partial x_2 \partial x_1 }(x) & \frac{\partial^2 f}{\partial x_2^2 }(x) & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(x) \\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n \partial x_1}(x) & \frac{\partial^2 f}{\partial x_n \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x) \\ \end{bmatrix} \in \mathbb{R}^{n \times n}.


    Proof (sketch).
    Step 0. The proof is a little more involved, so let us just give two intuitive justifications.
    Justification 1. The second order Taylor approximation gives

    f(y) = f(x) + \nabla f(x)^T (y-x)  + \frac{1}{2} (y-x)^T \nabla^2 f(x)(y-x)

    up to some small error. But

    \nabla^2 f(x) \succeq 0

    implies

    (y-x)^T \nabla^2 f(x)(y-x) \geq0

    and so

    f(y) \geq f(x) + \nabla f(x)^T (y-x)

    (again, up to some small error).
    The first order approximation from Convex Function Theory.First Order Characterization then implies convexity.
    Justification 2. Another intuitive justification is that \nabla^2 f(x) \succeq 0 means the graph of f curves everywhere upward like a paraboloid, which evidently suggests convexity.

    Remarks.
    1. Recall: for x \in \text{dom}\,f , there holds

      \begin{aligned} \nabla^2 f(x) \succeq 0 & \iff \nabla^2 f(x) \text{ is positive semidefinite}\\ & \iff \nabla^2 f(x) \in \boldsymbol{S}_+^n. \end{aligned}

    2. \nabla^2 f (x) \succ 0 for all x \in \text{dom}\, f implies f is strictly convex.
      Converse is false since x^4 is strictly convex.

    Level Sets Fix a function f: \mathbb{R}^n \to \mathbb{R} and let c \in \mathbb{R}
    c-Level set: the set

    S(c):=\{ x \in \text{dom}\, f : f(x) = c\}.

    The figure below depicts level sets of x^2 + e^{y^2} with c=5,10,17,26.


    c-Sublevel set: the set

    S_c = \{ x \in \text{dom}\,f : f(x) \leq c\}.

    The figure below depicts the sublevel sets of x^2+e^{y^2} with c=5,10,17,26.
    Each shade of gray indicates a new sublevel set and of course S_c \subset S_{c'} for c<c' .


    c-Superlevel set: the set

    S^c = \{ x \in \text{dom}\,f : f(x) \geq c \}.

    The figure below depicts the superlevel sets of x^2+e^{y^2} with c=5,10,17,26.
    Each shade of gray indicates a new superlevel set and of course S^c \supset S^{c'} for c<c' .


    Proposition. If f is convex, then the sublevel set S_c is convex for all c \in \mathbb{R}.
    Equivalently, if f is concave, then the superlevel set S^c is convex for all c \in \mathbb{R} .


    Proof. Want to show: x,y \in S_c implies tx + (1-t)y \in S_c for all t \in [0,1] .
    If x,y \in S_c , then f(x),f(y) \leq c and so convexity of f gives

    \begin{aligned}  f(tx+(1-t)y) &\leq t f(x) + (1-t)f(y) \\ &\leq t c + (1-t)c \\ &= c  \end{aligned}


    and hence tx + (1-t)y \in S_c as desired.
    Graphs Fix a function f: \mathbb{R}^n \to \mathbb{R}.
    Graph: the set

    \{ (x,f(x)):x \in \text{dom}\,f \} \subset \mathbb{R}^n \times \mathbb{R}.

    Example: graph of e^x is given below.


    Epigraph: the set

    \text{epi}\, f = \{ (x,t): x \in \text{dom}\, f, f(x) \leq t \} \subset \mathbb{R}^n \times \mathbb{R}.

    Example: epigraph of e^x is given below.


    Hypograph: the set

    \text{hypo}\, f = \{ (x,t) : x \in \text{dom}\, f, f(x) \geq t \} \subset \mathbb{R}^n \times \mathbb{R}.

    Example: hypograph of e^x is given below.


    Proposition. f is convex iff \text{epi}\, f is convex.
    Equivalently, f is concave iff \text{hypo}\, f is convex.
    Proof. (sketch) We consider the case f:\mathbb{R} \to \mathbb{R} for simplicity.
    Step 1. (\implies ) Suppose f is convex and let x,y \in \text{epi}\,f be distinct points.
    If x,y both lie on a vertical line, then clearly tx + (1-t)y \in \text{epi}\,f for t \in [0,1] ; thus, suppose otherwise.
    Let \ell be the line passing through x,y and let x',y' be the two intersection points of \ell with the graph of f .
    (If at most one intersection point exists, then it is easy to see that the line connecting x and y is in \text{epi}\,f .)
    By convexity of f , the line formed by tx' + (1-t)y' for t \in [0,1] lies in \text{epi}\,f , which is enough to conclude the line given by tx + (1-t)y for t \in [0,1] lies in \text{epi}\,f.
    This shows \text{epi}\, f is convex.
    Step 2. (\impliedby ) Suppose now \text{epi}\,f is convex.
    Let x,y be two distinct points on the graph of f .
    Then x,y \in \text{epi}\, f.
    But convexity of \text{epi}\,f implies the line formed by tx+(1-t)y for t \in [0,1] lies entirely in \text{epi}\,f.
    This is enough to conclude f is convex.
    Convex Calculus The following list details some operations and actions that preserves convexity.
    The main point: to conclude a function f is convex, often one verifies f may be built by other convex functions using, for example, the operations below.
    N.B.: Conclusions only holds on common domains of the functions.
    Conical combinations:

    f_1,\ldots,f_m convex and c_1,\ldots,c_m \geq0

    \implies

    c_1 f_1 + \cdots + c_m f_m convex.

    Weighted averages:

    f(x,y) convex in x, w(y) \geq0 \implies \int f(x,y) w(y) dy convex.

    Affine change of variables:

    f:\mathbb{R}^n \to \mathbb{R} convex, A \in \mathbb{R}^{n\times m}, b \in \mathbb{R}^n

    \implies

    f(A x + b) convex.

    Maximum:

    f_1,\ldots,f_m convex

    \implies

    f(x) := \max\{f_1(x),\ldots,f_m(x)\} convex.

    Supremum:

    f(x,a) convex in x for each a \in \mathcal{A}

    \implies

    h(x):=\sup\{f(x,a):a \in \mathcal{A}\} convex.



    Justification. For t \in [0,1] , there holds

    \begin{aligned}  h(tx + (1-t)y) &= \sup \{ f(tx + (1-t)y,a) : a \in \mathcal{A} \}\\ &\leq \sup \{ tf(x,a) + (1-t) f(y,a) : a \in \mathcal{A}\}\\ &\leq t \sup\{ f(x,a) :a \in \mathcal{A}\} + (1-t)\sup\{f(y,a):a \in \mathcal{A} \}\\ &=th(x) + (1-t)h(y). \end{aligned}



    Example. Let

    \begin{aligned} g(x)&:\mathbb{R}^n \to \mathbb{R} \text{ be given}\\ f(x,y)&:= y^T x - g(x). \end{aligned}

    N.B.: for each x , the mapping y \mapsto f(x,y) is affine and hence convex.
    Thus

    h(y):=\sup\{y^Tx - g(x) : x \in \text{dom}\,g \}

    defines a convex function.
    Infimum:

    f:\mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R} convex in (x,y)

    C \subset \mathbb{R}^m convex

    \inf_{y \in C}f(x,y) finite for some x

    \implies

    g(x):=\inf_{y \in C}f(x,y) convex on

    \text{dom}\,g := \{ x : (x,y) \in \text{dom}\, f \text{ for some } y \in C\}.

    Fenchel conjugation Let f:\mathbb{R}^n \to \mathbb{R} be given (not necessarily convex).
    Fenchel conjugate: f^*(y) = \sup \{ y^T x - f(x) : x \in \text{dom}\,f\}.

    N.B.:

    \text{dom}\, f^* = \{ y \in \mathbb{R}^n : f^*(y) < \infty \} ;

    i.e., those y \in \mathbb{R}^n for which y^Tx - f(x) is bounded above on \text{dom}\,f as a function of x .

    Intuition. Suppose f: \mathbb{R}_+ \to \mathbb{R}_+ is a differentiable convex function denoting the cost to produce x items.
    For a given unit price y \in \mathbb{R}_+ , the profit of selling x units is

    P(x,y) = yx - f(x) .

    Thus f^*(y) is just the optimal profit for selling at price y .
    N.B.: f convex implies P(\cdot,y) is concave for each y .
    Thus, P(x,y_0) is maximal at x_0 satisfying P'(x_0,y_0) = 0 , i.e., when y_0 = f'(x_0) .
    Viz.: the x_0 where f has slope y_0 .
    The tangent line through (x_0,f(x_0)) is then given by y=y_0 (x - x_0) + f(x_0) .
    Lastly, note that the y-intercept of this line is -y_0x_0 + f(x_0)=-f^*(y_0) ,


    Remarks
    1. Often f^* is just called the conjugate function of f .
    2. Since f^* is the supremum of a family of affine functions, f^* is always convex, even if f is not.
      (Follows from Convex Function Theory.Convex Calculus.
    3. If
      • f is convex
      • \text{epi}\,f is a closed subset of \mathbb{R}^n \times \mathbb{R} ,
      then f^{**} = f .


    Example. We will compute the conjugate function of

    \begin{aligned} f(x) &= e^x\\ \text{dom}\,f &= \mathbb{R}. \end{aligned}


    Thus, let

    \begin{aligned} h_y(x) &= yx - f(x) \\ &= yx - e^x . \end{aligned}

    Case y<0 : h_y(x) is unbounded on \text{dom}\,f = \mathbb{R} since

    h_y(x) \to +\infty as x \to -\infty .

    Thus

    \sup\{yx-e^x: x \in \text{dom}\,f\} = +\infty .



    Case y>0 : Compute

    \begin{aligned} h_y'(x) &= y - e^x =0 \text{ when } x=\log y \\  h_y''(x)&=-e^x \leq 0. \end{aligned}

    Thus x=\log y maximizes h_y and so

    \begin{aligned} \sup\{yx-e^x:x \in \text{dom}\,f\} &= \max\{yx-e^x:x \in \text{dom}\,f \}\\ &= h_y(\log y) \\ &= y\log y - y  \end{aligned}



    Case y=0 : Compute

    h_0(x) = -e^x,

    which evidently has least upper bound 0 and so

    \sup\{ -e^x : x \in \text{dom}\,f \} = 0.



    Conclusion: Since

    yx-e^x

    is bounded on \text{dom}\,f only for y \geq 0 , it follows that

    \text{dom}\, f^* = \mathbb{R}_{+} .

    Putting everything together:

    f^*(y) = \sup\{yx-e^x\} = y\log y - y for y \in \mathbb{R}_+ ,

    where we take 0 \log 0 = 0 .
    Legendre Transform Let f:\mathbb{R}^n \to \mathbb{R} be convex, differentiable and with \text{dom}\,f = \mathbb{R}^n .
    Then, the Fenchel conjugate f^* of f is often called the Legendre transform of f .


    Proposition. If f as above, z \in \mathbb{R}^n and y = \nabla f(z), then

    f^*(y) = z^T \nabla f(z) - f(z) .

    Proof.
    Step 1. Let

    h_y(x) = y^T x - f(x) .

    and note

    x^\star \in \mathbb{R}^n maximizes h_y iff \nabla h_y(x^\star) = 0

    since h_y is a sum of concave functions and hence concave.
    Step 2. Using Step 1. and

    \begin{aligned}  \nabla(y^T x) &= y\\ \nabla h_y(x) &= \nabla(y^Tx-f(x)) = y - \nabla f(x). \end{aligned}

    conclude

    y = \nabla f(x^\star) iff x^\star maximizes h_y .

    (In particular, z \in \mathbb{R}^n maximizes h_{\nabla f(z)} .)
    Step 3. Letting z, y \in \mathbb{R}^n satisfy

    \begin{aligned} y &= \nabla f(z) \end{aligned}

    and using Steps 1. and 2., we conclude

    \begin{aligned}  f^*(y) &= \sup \{ y^T x - f(x) : x \in \text{dom}\, f\}\\ &= \max \{ h_y(x) : x \in \text{dom}\,f \}\\ & = z^T \nabla f(z) - f(z)  \end{aligned} ,

    as desired.


    Example 1. Let

    f(x) = e^x ,

    and compute

    f'(x) = e^x .

    Given z \in \mathbb{R} , let

    y = f'(z) = e^z ; i.e., z = \log y .

    Thus

    f^*(y) = z f'(z) - f(z) = y \log y - y ,

    which agrees with our calculation for f^* in a previous example.


    Example 2. Fix Q \in \boldsymbol{S}_{++}^n and let

    \begin{aligned} f(x) &= \frac{1}{2} x^T Q x\\ \text{dom}\,f&= \mathbb{R}^n. \end{aligned}

    We will compute

    f^*(y) = \frac{1}{2}y^T Q^{-1}y .

    Step 0. Observe
    1. f is convex:

      \begin{aligned} \nabla^2 f(x) = Q \succ 0. \end{aligned}

      (Justification) Consider case n = 2 .
      Let Q = \begin{bmatrix}a&b\\b&d\end{bmatrix}.
      Thus f(x_1,x_2) = \frac{1}{2}(ax_1^2 + 2 b x_1 x_2 + c^2 x_2^2) .
      Easy now to see \nabla^2 f = Q .
    2. Q \succ 0 implies Q is invertible since then \det Q > 0 .
    Step 1. Using

    \begin{aligned} \nabla f(x) &=\nabla(\frac{1}{2}x^TQx) \\ &= Qx, \end{aligned}

    we conclude

    y = \nabla f(z) \iff y = Qz \iff z = Q^{-1}y.

    Step 2. Let y = \nabla f(z) .
    By preceding proposition and Step 1., there holds

    \begin{aligned}  f^*(y) &= z^T \nabla f(z) - f(z)\\ &= (Q^{-1}y)^T y - \frac{1}{2}(Q^{-1}y)^TQ(Q^{-1}y)\\ &= y^T Q^{-1}y - \frac{1}{2} y^T Q^{-1} y\\ &= \frac{1}{2}y^T Q^{-1} y. \end{aligned}

    Other Notions of Convexity There are two other important notions of convexity that we will return to if needed.
    Let f:\mathbb{R}^n \to \mathbb{R} be given.

    Quasiconvexity: \text{dom}\,f and the sublevel sets

    \{x \in \text{dom}\,f: f(x) \leq \alpha \}

    are convex for all \alpha \in \mathbb{R}.
    Features:
    1. Quasiconvex problems may sometimes be suitably approximated by convex problems.
    2. Local minima need not be global minima


    Log-convexity: f>0 on \text{dom}\, f and \log f is convex; equivalently

    f(tx+(1-t)y) \leq f(x)^tf(y)^{1-t} for all t \in [0,1].

    Generalized Convexity Let K \subset \mathbb{R}^m be a proper cone and let f: \mathbb{R}^n \to \mathbb{R}^m .

    K -convexity: for all x,y \in \mathbb{R}^n and t \in [0,1] , there holds

    f(tx + (1-t)y) \preceq_K t f(x) + (1-t)f(y) .

    Strict K -convexity: for all x \neq y \in \mathbb{R}^n and t \in (0,1) , there holds

    f(tx + (1-t)y) \prec_K t f(x) + (1-t)f(y) .

    Examples
    1. (CO Example 3.47)
      Let K = \mathbb{R}_+^n .
      Then f: \mathbb{R}^n \to \mathbb{R}^m is K-convex iff: \text{dom}\, f is convex and for all x,y \in \text{dom}f and t \in[0,1], there holds

      f(tx+(1-t)y) \preceq tf(x) + (1-t)f(y)

      which holds iff

      f_i(tx+(1-t)y) \leq tf_i(x) + (1-t)f_i(y)

      for each i = 1,\ldots, m , i.e., iff f is component-wisely convex.
    2. (CO Example 3.48)
      A function f: \mathbb{R}^n \to \boldsymbol{S}^m is \boldsymbol{S}_+^m-convex iff : \text{dom}\,f is convex and for all x,y \in \text{dom}\,f and t \in [0,1] , there holds

      f(tx + (1-t)y) \preceq t f(x) + (1-t)f(y) .

      N.B.:
      • this is a matrix inequality and \boldsymbol{S}_+^n -convexity is often called matrix convexity.
      • f is matrix convex iff z^Tf(x)z is convex for all z \in \mathbb{R}^m .
      • The two functions

        \begin{aligned} \mathbb{R}^{n \times m} \ni X &\mapsto XX^T\\ \boldsymbol{S}_{++}^n \ni X &\mapsto X^p, \quad 1 \leq p \leq 2, -1 \leq p \leq 0. \end{aligned}

        are matrix convex.
    Basics of Optimization Problems
    General Optimization Problems By an optimization problem (OP) we mean the following:

    \text{(OP)} \begin{cases} \text{minimize } & f_0(x) \quad \text{(objective)}\\ \text{subject to }& f_i(x) \leq 0, \quad i=1,\ldots,m \quad \text{(inequality constraints)}\\ & h_i(x) = 0, \quad i=1,\ldots,p \quad \text{(equality constraints)}  \end{cases}.

    We call

    f_0:\mathbb{R}^n \to \mathbb{R} the objective function;
    x \in \mathbb{R}^n the optimization variable or parameters;
    f_i:\mathbb{R}^n \to \mathbb{R}, i=1,\ldots,m, the inequality constraint functions; and
    h_i:\mathbb{R}^n \to \mathbb{R}, i=1,\ldots,p, the equality constraint functions.

    The domain of (OP) is the intersection

    D = \bigcap_{i=0}^{m} \text{dom} \, f_i \cap \bigcap_{i=1}^{p} \text{dom} \, h_i.

    Feasibility Consider an (OP) as above.
    Feasible point: those x \in D satisfying

    \begin{aligned} f_i(x)&\leq 0\quad\text{for } i =1,\ldots,m\\  h_i(x) &= 0 \text{ for }i=1,\ldots,p. \end{aligned}

    Feasible set: the subset F \subset D consisting of the feasible points.
    Feasible problem: A problem with nonempty feasible set, i.e., F \neq \emptyset.
    Infeasible problem: A problem with empty feasible set; i.e., there are no x \in D which satisfy the inequality and equality constraints.

    Remark.
    1. A feasible problem need not have a solution; e.g., f(x) = e^x has no minimizer nor minimum on \mathbb{R} .
    2. An infeasible problem never has a solution–there are no parameters x to even test.
    Basic Example Consider the problem

    \begin{cases} \text{minimize } & \log(1-x^2-y^2)\\  \text{subject to }& (x-1)^2+(y-1)^2-1\leq0\\ & (x-y-1)^2+(y-1)^2-1\leq0 \end{cases}.

    The objective function is

    \begin{aligned}  f_0(x,y) &= \log(1-x^2-y^2)\\ \text{dom}\,f_0 &= \{ (x,y) \in \mathbb{R}^2 : x^2+y^2<1\} \end{aligned} ,

    The inequality constraint functions are

    \begin{aligned} f_1(x,y) &=(x-1)^2 + (y-1)^2-1\\ f_2(x,y) &=(x-y-1)^2 + (y-1)^2 - 1\\ \text{dom}\,f_1 &= \text{dom}\,f_2 = \mathbb{R}^2. \end{aligned} .

    The domain of the problem is

    D = \text{dom}\, f_0 \cap \text{dom}\, f_1 \cap \text{dom} f_2 = \{ x^2+y^2<1\}. .

    The feasible set: Let

    \begin{aligned} A &= \text{dom}\,f_0\\ B &= \{(x-1)^2+(y-1)^2-1\leq0\}\\ C &= \{(x-y-1)^2 +(y-1)^2-1\leq0\} \end{aligned} .

    These three sets are depicted in the image below.
    Note that the darkest region given by A \cap B \cap C is the feasible set.
    Can we solve the problem? Noting
    1. \log(1-x^2-y^2) \to -\infty as (x,y) approaches a point on the circle \{ x^2+y^2 = 1 \} , and
    2. such sequences exist in the feasible set,
    we conclude the problem does not have a solution.
    The Feasibility Problem Feasibility problem: Given an (OP) with

    \begin{aligned} &\text{inequality constraint functions } f_i, i = 1,\ldots,m\\ &\text{equality constraint functions } h_i, i = 1, \ldots,p \end{aligned}

    solve

    \begin{cases} \text{find} & x\\ \text{subject to} & f_i(x) \leq 0, i = 1, \ldots, m\\ & h_i(x) = 0, i = 1,\ldots, p \end{cases}.

    Viz.: the feasibility problem determines whether the constraints are consistent.

    Example 1. The problem

    \begin{cases} \text{find} & (x,y)\\ \text{subject to} &f_1(x,y) = x^2 + y^2 - 1 \leq 0\\ &f_2(x,y) = (x-1)^2 + y^2 - 1 \leq 0 \end{cases}

    has a solution since the two inequality constraints describe two intersecting disks.
    This is depicted below.


    Example 2. The problem

    \begin{cases} \text{find} & (x,y)\\ \text{subject to} &f_1(x,y) = x^2 + y^2 - 1 \leq 0\\ &f_2(x,y) = (x-1)^2 + y^2 - 1 \leq 0\\ &h_1(x,y) = (x-\frac{1}{2})^2 + y^2 - 1 =0 \end{cases}

    has no solution since the circle given by h_1=0 lies outside of the intersection of the two disks.
    This is depicted below, where the red circle is given by h_1=0 .
    Optimal Value and Solvability Recall:

    \begin{aligned} F &= \text{ feasible set of problem}\\ D &= \text{ domain of problem }. \end{aligned}

    Optimal value: The value

    p^\star = \inf \{ f_0(x) : x \in F \},

    i.e., p^\star is the largest p \in \mathbb{R} such that p\leq f_0(x) for all x \in F .
    N.B.: p^\star \in \mathbb{R} \cup \{ -\infty, + \infty \} .

    Example: Below depicts the graph of f(x) = \frac{1}{x} on \mathbb{R}_{++} .
    Evidently, \inf \{ \frac{1}{x} : x \in \mathbb{R}_{++} \} = 0 .




    Solvable: When the problem satisfies

    there exists x^\star \in F with f_0(x^\star) = p^\star,

    i.e., the minimum value p^\star is attainable.

    Example: Below depicts the graph of a quartic q(x) .
    The problem of minimizing q(x) on \mathbb{R} is solvable with solution given by the minimal point A .
    N.B.: Point B is a local minimum and hence does not give a solution.


    Remarks.
    1. p^\star = \min\{f_0(x):x \in F \} iff the (OP) is solvable.
      Indeed, \min\{f_0(x):x \in F \} is not well-defined unless the (OP) is solvable.
    2. p^\star need not be finite:
      p^\star = -\infty if f_0 is unbounded below on the feasible set; and
      p^\star = + \infty if the OP is infeasible.
    Standard Form Optimization problems need not be placed in the form we defined them.
    We therefore introduce the following definition.

    (OP) in Standard form:

    \text{(OP)} \begin{cases} \text{minimize } & f_0(x) \\ \text{subject to }& f_i(x) \leq 0, \quad i=1,\ldots,m \\ & h_i(x) = 0, \quad i=1,\ldots,p  \end{cases}.

    (This is how we defined (OP) before.)

    Example: Rewriting in standard form. We can recast more general optimization problems in standard form; e.g., consider

    \text{(OP2)} \begin{cases} \text{maximize } & F_0(x) \\ \text{subject to }& F_i(x) \leq G_i(x), \quad i=1,\ldots,m \\ & H_i(x) = K_i(x), \quad i=1,\ldots,p \end{cases}.

    Indeed, taking
    f_0 = -F_0 (noting \min f_0 = -\max F_0)
    f_i = F_i - G_i for i=1,\ldots,m
    h_i = H_i - K_i for i=1,\ldots,p
    we readily recast (OP2) into the standard form (OP).
    Equivalent Problems Suppose we are given two OP’s: (OP1) and (OP2).
    We say (OP1) and (OP2) are equivalent if: solving (OP1) allows one to solve (OP2), and vice versa.
    N.B.: Two problems being equivalent does not mean the problems are the same nor that they have the same solutions.

    Example. Consider the two problems:

    \begin{cases} \text{minimize} & f(x) = x^2\\ \text{subject to}& x \in [1,2] \end{cases}  \quad \text{and} \quad  \begin{cases} \text{minimize}&g(x) = (x+1)^2+1 \\ \text{subject to}&x \in [0,1] \end{cases}.

    Observing

    x^\star \in [1,2] minimizes f(x) on [1,2]

    iff

    x^\star - 1 \in [0,1] minimizes g(x) on [0,1] ,

    we readily see the two problems are equivalent.
    Indeed, if we find the solution x^\star = 1 to the first problem, we readily obtain the solution x^\star - 1 = 0 to the second problem, and vice versa.
    Change of Variables Suppose \phi:\mathbb{R}^n \to \mathbb{R}^n is an injective function with D \subset \phi(\text{dom} \,\phi).
    Then, under the change of variable x \mapsto \phi(x) , we have

    \text{(OP1)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}

    is equivalent to

    \text{(OP2)} \begin{cases} \text{minimize} & f_0(\phi(x))\\ \text{subject to}& f_i(\phi(x)) \leq 0, i=1,\ldots,m\\ & h_i(\phi(x)) = 0, i=1,\ldots,p \end{cases}.

    N.B.: such a change of variables does not change the optimal value p^\star .
    Moreover, injectivity may be dropped.

    Justification. Indeed,
    • if x solves (OP1), then \phi^{-1}(x) solves (OP2).
      (More generally, z such that \phi(z) = x solves (OP2).)
    • if z solves (OP2), then \phi(z) solves (OP1).
    Example Consider the problem

    \begin{cases} \text{minimize} & e^x\\ \text{subject to} & \sqrt{x} - y \leq 0\\ &y-5 \leq 0\\ &x-5\leq 0 \end{cases} .

    In the image below, the shaded region is the feasible set and the curve is the graph of f_0(x) = e^x .
    Consider the change of variables

    \phi(x) = x^2 .

    The objective and constraints change as follows:

    \begin{aligned} f_0(x) = e^x & \to f_0(\phi(x)) = e^{x^2}\\ \sqrt{x}-y \leq 0 & \to |x| - y \leq 0\\ y-5 \leq 0 & \to y - 5 \leq 0\\ x - 5 \leq 0 & \to x^2 - 5 \leq 0 \end{aligned}

    The new feasible region and objective function are plotted below.
    Evidently, this change of variable changed a nonconvex (OP) into a convex one.
    Eliminating Linear Constraints Let A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m and x_0 \in \mathbb{R}^n a solution to Ax=b.
    Let B \in \mathbb{R}^{n\times k} be such that \text{range}\, B = \text{kernel}\, A.
    Then Ax=b iff x=By + x_0 for some y \in \mathbb{R}^{k}.

    Consequently

    \text{(OP1)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & Ax = b \end{cases}

    is equivalent to

    \text{(OP2)} \begin{cases} \text{minimize} & f_0(By+x_0)\\ \text{subject to}& f_i(By+x_0) \leq 0, i=1,\ldots,m\\ \end{cases}.

    N.B.: this can reduce dimension of problem by \text{rank}\, A many variables. (Recall: n = \text{rank}\,A + \text{null}\,A.)

    Justification. Indeed,
    • if x solves (OP1), then any y \in \mathbb{R}^m with x = By+x_0 solves (OP2), and
    • if y solves (OP2), then x = By + x_0 solves (OP1)


    Example. Consider the minimization problem

    \begin{cases} \text{minimize} & x^2 + y^2 \\ \text{subject to} &x\geq0\\ & y-x=1 \end{cases}.

    We may eliminate the variable y by simply using y=x+1 .

    But, to match with above: let

    \begin{aligned} f_0(x,y)=x^2 +y^2, &\quad A = \begin{bmatrix}-1&1\end{bmatrix}, \quad b = 1 \\ x_0 = \begin{bmatrix}0\\1\end{bmatrix},&\quad B = \begin{bmatrix} 1\\1\end{bmatrix} \end{aligned} .

    Thus

    A \begin{bmatrix}x\\y\end{bmatrix}=b \iff \begin{bmatrix}x\\y \end{bmatrix} = Bt + x_0 = \begin{bmatrix}t\\t+1\end{bmatrix}

    for some t \in \mathbb{R} , and so

    f_0(Bt + x_0) = f_0(t,t+1) = t^2 + (1+t)^2 .



    Therefore, the minimization problem becomes

    \begin{cases} \text{minimize} & t^2 + (t+1)^2 \\ \text{subject to} &t\geq0\\ \end{cases},

    which has the obvious solution t^\star = 0 with optimal value p^\star = 1 .
    Thus, the original problem has solution

    \begin{bmatrix}x\\y\end{bmatrix} = \begin{bmatrix}t\\t+1\end{bmatrix} = \begin{bmatrix}0\\1\end{bmatrix} .

    Slack Variables Given f:\mathbb{R}^n \to \mathbb{R} with f(x) \leq 0 , then there is a variable s \geq 0 such that f(x) + s = 0 ; such a variable s is called a slack variable.

    Using slack variables s_i,i=1\ldots,m , the problem

    \text{(OP1)}\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i = 1,\ldots,p \end{cases}.

    is equivalent to the problem

    \text{(OP2)}\begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & s_i \geq 0, i = 1,\ldots,m\\ & f_i(x)+s_i = 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}.



    Remarks.
    1. All of the x which may satisfy the constraints of (OP2) are the same as those which satisfy the constraints of (OP1); this justifies the equivalence.
    2. Let F_1 be the feasible set of (OP1) and F_2 that of (OP2).
      Then F_1 \subset \mathbb{R}^n and F_2 \subset \mathbb{R}^{n+m} ; i.e., the feasible sets are not the same object.
    3. Example: in the images below, the disk depicts a feasible set F_1 = \{ x^2+y^2-1 \leq 0 \} \subset \mathbb{R}^2 and the paraboloid-type set depicts the feasible set F_2 = \{x^2+y^2-1+s =0, s \geq 0 \} with slack variable s .
      N.B.: the permissible (x,y) coordinates are the same for both sets.


    Main point: Solving the system of equations

    \begin{aligned} f_i(x)+s_i &= 0, i=1,\ldots,m\\  h_i(x) &= 0, i=1,\ldots,p \end{aligned}

    and considering only those solutions with s_i \geq 0 may be easier than solving the system of inequalities

    \begin{aligned} f_i(x)&\leq0, i=1,\ldots,m\\  h_i(x) &= 0, i=1,\ldots,p \end{aligned}.



    Example. Consider

    \text{(OP1)} \begin{cases} \text{minimize} & f_0(x,y)\\ \text{subject to} & a_1 x + b_1 y - c_1 \leq 0\\ & a_2 x + b_2 y - c_2 = 0 \end{cases}.

    Introduce slack variable s\geq0 satisfying

    a_1 x + b_1 y - c_1 +s = 0.

    Then (OP1) is equivalent to the problem

    \text{(OP2)} \begin{cases} \text{minimize} & f_0(x,y)\\ \text{subject to} & s\geq0\\ & a_1 x + b_1 y - c_1 + s = 0\\ & a_2 x + b_2 y - c_2 = 0 \end{cases}.

    Thus, finding feasible (x,y,s) is just a matter of solving a system of equations and choosing those (x,y,s) with s\geq0 .

    Moreover, one can solve the problem

    \text{(OP3)} \begin{cases} \text{minimize} & f_0(x,y)\\ \text{subject to} & a_1 x + b_1 y - c_1 + s = 0\\ & a_2 x + b_2 y - c_2 = 0 \end{cases}.

    and just choose solutions with s \geq 0 to obtain solutions to (OP2), and hence (OP1).
    Epigraph Form Recall: \text{epi}\, f = \{(x,t): x \in \text{dom}\,f, t \geq f(x) \}.

    The optimization problem

    \text{(OP1)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to}& f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}

    is equivalent to its epigraph form

    \text{(OP2)} \begin{cases} \text{minimize} & t\\ \text{subject to} &f_0(x) - t \leq0\\ & f_i(x) \leq 0, i=1,\ldots,m\\ & h_i(x) = 0, i=1,\ldots,p \end{cases}

    Viz., minimizing f_0 subject to constraints is equivalent to finding the smallest t such that (x,t) \in \text{epi}\, f for some feasible x .

    Proof by picture. The dark curve and shaded region below indicate the epigraph of a function f .
    The red dot indicates the minimum point (x^\star,p^\star) .
    The black dots indicate points (x^\star,t) \in \text{epi}\,f for different values of t .
    Evidently, the smallest t^\star for which (x^\star,t^\star) \in \text{epi}\, f is given by t^\star = p^\star .
    Fragmenting a Problem Proposition. Given f: \mathbb{R}^n \to \mathbb{R} and sets F,F_1,\ldots,F_q with F = F_1 \cup \cdots \cup F_q, let

    \begin{aligned} p^\star &= \inf \{f(x):x \in F\}\\ p_i^\star &= \inf\{ f(x): x \in F_i\}, \quad i = 1,\ldots, q. \end{aligned}

    Then

    p^\star = \min\{p_i^\star: i = 1,\ldots,q\}.

    (Assuming \min\{-\infty,a\} = -\infty for any real number a .)


    Viz., to minimize a function on a set F, one may instead minimize f over pieces of F and then take the minimum optimal value of this procedure.


    Example. Consider the (OP)

    \text{(OP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & x \in F \end{cases}

    where F \subset \text{dom}\, f and where the feasible set F is depicted below.
    Consider breaking up F into three regions F1,F2,F3 as indicated below.
    Now formulate the (OP)’s

    \text{(OPi)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & x \in Fi \end{cases}

    for i = 1,2,3 , and let p_i^\star be the optimal value for (OPi).
    Using the preceding proposition, the optimal value of (OP) is given by

    p^\star = \min \{ p_1^\star, p_2^\star, p_3^\star \} .

    Conclusion: solving (OP), whose feasible set is not convex, may be achieved by solving three subproblems (OP1),(OP2),(OP3) whose feasible sets are convex.
    Basics of Convex Optimization
    Convex Optimization Problems Abstract convex optimization problem: A problem involving minimizing a convex objective function on a convex set.
    Convex optimization problem: a problem of the form

    \text{(COP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0, i = 1,\ldots,m\\ & a_i^T x = b_i, i =1,\ldots, p \end{cases},

    where
    f_0:\mathbb{R}^n \to \mathbb{R} and f_i:\mathbb{R}^n \to \mathbb{R} are convex; and
    a_i \in \mathbb{R}^n and b_i \in \mathbb{R} are fixed.


    Some Remarks
    Remark 1. As defined, a (COP) is an (OP) in standard form; naturally, there are nonstandard form (OP)’s equivalent to (COP)’s.
    E.g., the abstract (COP)

    \begin{cases} \text{minimize} & f_0(x,y) \\ \text{subject to} & (x+y+1)^2 = 0 \end{cases}

    is readily seen to be equivalent to the standard form (COP)

    \begin{cases} \text{minimize} & f_0(x,y) \\ \text{subject to}& \begin{bmatrix}1&1\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix} + 1 = 0 \end{cases} .



    Remark 2. We emphasize: the equality constraints are assumed to be affine constraints.
    Moreover, the equality constraints

    a_i^T x = b_i , \quad i=1,\ldots,p

    can be rewritten as

    A x = b ,

    where

    \begin{aligned} A &= \begin{bmatrix} a_1^T\\a_2^T\\\vdots\\a_p^T\end{bmatrix} \in \mathbb{R}^{p \times n}, \quad b = \begin{bmatrix}b_1\\b_2\\\vdots\\b_p\end{bmatrix} \in \mathbb{R}^{p} \end{aligned} .



    Remark 3. The affine assumption on the equality constraints can be lifted at the possible expense of an intractable theory/numerical analysis.
    E.g., if h: \mathbb{R}^n \to \mathbb{R} is quasilinear, then h(x) = 0 defines a convex set.


    Remark 4. Generally, h(x) convex does not imply the level set h(x) = 0 is convex; e.g., h(x,y)=x^2+y^2-1 gives a sphere.


    Remark 5. The common domain

    D = \bigcap_{i=0}^m \text{dom}\, f_i

    is convex since it is an intersection of convex sets.
    Optimality for Convex Optimization Problems Assume throughout that f_0:\mathbb{R}^n \to \mathbb{R} is the objective function for some given (COP) and that F is the feasible set.

    Proposition 1. If x^\star is a feasible local minimizer for a (COP), then it is the global minimizer for the (COP).
    Proof. We will follow a proof by contradiction; i.e., we will show that assuming x^\star is not a global minimizer leads to a contradiction.
    Step 1. x^\star being a feasible local minimizer means x^\star \in F and that there is a R>0 such that

    f_0(x^\star) = \inf\{ f_0(z) : z \in F, \quad \Vert x-x^\star \Vert_2 \leq R \} ;

    i.e., f_0(x^\star) \leq f_0(z) for all z \in F with a distance at most R of x^\star .
    Step 2. Supposing x^\star is not a global minimizer, then there exists y \in F such that f_0(y) < f_0(x^\star) .
    By choice of R , there must also hold \Vert{y-x^\star}\Vert_2 > R .
    Step 3. Set

    \begin{aligned} z &= (1-t)x^\star + ty\\ t &= \frac{R}{2\Vert y-x^\star \Vert_2}, \end{aligned}

    noting that t \in [0,1] by Step 2. and so z \in F since F is convex.
    It follows that

    \begin{aligned} \Vert z - x^\star \Vert_2 &= \Vert (1-t)x^\star + ty - x^\star \Vert_2\\ &= \Vert t(y-x^\star ) \Vert_2\\ &= \frac{R}{2\Vert y-x^\star\Vert_2} \Vert y-x^\star\Vert_2\\ &= \frac{R}{2}\\ &< R \end{aligned}

    Step 4. Since z is a convex combination of feasible points, since f_0 is convex and since f_0(y)<f_0(x^\star) , there holds

    \begin{aligned}  f_0(z) &= f_0((1-t)x^\star + ty) \\ &\leq (1-t)f_0(x^\star) + t f_0(y)\\ & < (1-t) f_0(x^\star) + t f_0(x^\star)\\ &= f_0(x^\star). \end{aligned}

    But, since x^\star minimizes f_0 on

    \{x \in F : \Vert x - x^\star \Vert_2 \leq R \}

    and since

    \Vert z - x^\star \Vert_2 \leq R

    we also have

    f_0(x^\star) \leq f_0(z).

    This is a contradiction and so x^\star must be a global minimizer.
    Proposition 2. If f_0 is differentiable on F , then x^\star \in F is a minimizer iff for all y \in F there holds

    \nabla f_0(x^\star)^T (y - x^\star) \geq 0 .

    Proof.
    Step 0. N.B.: since f_0 is differentiable and convex on F , then for each x,y \in F there holds

    f_0(y) \geq f_0(x) + \nabla f_0(x)^T (y-x) .

    (C.f., Convex Function Theory.First Order Characterization.)
    Step 1.(\implies ) Suppose x^\star is a minimizer and suppose for contradiction that

    \nabla f_0(x^\star)^T(y-x^\star)<0

    for some y \in F .
    Set z_t = ty+(1-t)x^\star , noting that z_t \in F since F is convex.
    Using

    \frac{d}{dt} f_0(z_t)|_{t=0} = \nabla f_0(x^\star)^T (y-x^\star) < 0 ,

    we conclude f_0 is decreasing near z_0 = x^\star in the direction y-x^\star and so f_0(z_t)< f_0(x^\star) for small t .
    Since z_t \in F , this contradicts x^\star being a minimizer.
    Additional justification Since z_t defines a line passing through x^\star with direction y-x^\star , it follows that \frac{d}{dt} f_0(z_t)|_{t=0} is the directional derivative in direction (y-x^\star), i.e., \nabla f_0(x^\star)^T(y-x^\star).
    Step 2.(\impliedby ) Supposing

    \nabla f_0(x^\star)(y-x^\star) \geq 0

    for all y \in F and using the first order characterization at x^\star , namely,

    f_0(y) \geq f_0(x^\star) + \nabla f_0(x^\star)^T(y-x^\star)

    we readily conclude

    f_0(y) \geq f_0(x^\star)

    for all y \in F ; i.e., that x^\star is a minimizer for the problem.
    Corollary In case f_0 is differentiable and F = \text{dom}\,f_0 (equivalently, there are no nontrivial constraints), x^\star \in \text{dom}\,f_0 is a minimizer iff

    \nabla f_0(x^\star) = 0 .

    Proof. By Proposition 2., we have that x^\star \in \text{dom}\,f_0 is a minimizer iff

    \nabla f_0(x^\star)^T(y-x^\star) \geq 0

    for all y \in \text{dom}\, f_0 .
    Differentiability of f_0 requires \text{dom}\,f_0 is open and so, for small t \in \mathbb{R} , there holds

    y: = x^\star - t \nabla f_0(x^\star) \in \text{dom}\,f_0 .

    But then

    \begin{aligned} \nabla f_0(x^\star)^T(y-x^\star) &= \nabla f_0(x^\star)^T(x^\star - t \nabla f_0(x^\star) - x^\star)\\ &= -t\nabla f_0(x^\star)^T\nabla f_0(x^\star)\\ &= -t \Vert \nabla f_0(x^\star) \Vert_2^2\\ &\geq 0, \end{aligned}

    which is only possible for t>0 iff \nabla f_0(x^\star) = 0 .
    Some Examples
    Example 1. Let

    \begin{aligned} Q \in &\boldsymbol{S}_+^n, \quad a \in \mathbb{R}^n\quad b \in \mathbb{R}\\ f_0(x)&= \frac{1}{2} x^T Q x + a^T x + b. \end{aligned}

    Consider the unconstrained problem:

    \text{(OP)} \begin{cases} \text{minimize} & f_0(x) = \frac{1}{2} x^T Q x + a^T x + b\\ \text{subject to }& x \in \mathbb{R}^n \end{cases} .

    Note that \nabla^2 f_0 = Q \succeq 0 implies f_0 is convex.
    C.f.,Convex Function Theory.Second Order Characterization. By the preceding corollary, we have x^\star is a solution to (OP) iff

    \nabla f_0(x^\star) = Qx^\star + a = 0.

    Thus solvability of (OP) rests on whether -a \in \text{range}\, Q.

    Three cases:
    1. -a \notin \text{range}\,Q \implies f_0 is unbounded below and hence (OP) is unsolvable;
    2. Q \succ 0 \implies Q is invertible and so x^\star = -Q^{-1}a is the unique solution to (OP);
    3. Q \not\succ 0 and -a \in \text{range}\,Q \implies Qx^\star = -a has an affine set of solutions.


    Example 2. Let

    \begin{aligned} f_0:\mathbb{R}^n \to \mathbb{R}&  \text{ be convex and differentiable}\\ A &\in \mathbb{R}^{m \times n}, \quad b \in \mathbb{R}^{m} \end{aligned}

    and consider the problem

    \text{(OP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & Ax = b \end{cases} .

    Using a preceding proposition, x^\star satisfying Ax^\star = b is a minimizer iff

    \nabla f_0(x^\star)^T(y-x^\star) \geq0

    for all y satisfying Ay= b.


    Two cases
    1. Ax=b is an inconsistent system \implies the problem is infeasible.
    2. Ax=b is a consistent system; then

      Ax^\star = b, Ay=b

      \iff

      y = x^\star + x for some x \in \text{null}A .



    In case 2., we have

    \nabla f_0(x^\star)^T(y-x^\star) = \nabla f_0(x^\star)^Tx \geq 0

    for all

    x \in \text{null}A and y = x^\star + x .

    Since \text{null}A is a linear space, this is only possible iff

    \nabla f_0(x^\star)^Tx = 0 for all x \in \text{null}\,A ,

    i.e., iff

    \nabla f_0(x^\star) \perp \text{null}A.

    But \text{null}A = \text{range}A^T and so this condition means there exists \nu \in \mathbb{R}^n such that

    \nabla f_0(x^\star) + A^T \nu = 0 .

    This is just a Lagrange multiplier condition, as we will see later.
    Linear Programming Linear program: a (COP) of the form

    \text{(LP)} \begin{cases} \text{miminize} & c^T x \\ \text{subject to} & Gx \preceq h\\ & Ax= b \end{cases} ,

    where

    \begin{aligned} &G \in \mathbb{R}^{m \times n}, \quad A \in \mathbb{R}^{p\times n}\\ &h \in \mathbb{R}^m, \quad b \in \mathbb{R}^p, \quad x \in \mathbb{R}^n. \end{aligned}

    The feasible set F is a polyhedron (see below).

    Recall (\preceq ): For a,b \in \mathbb{R}^n the vector inequality

    a \preceq b

    means

    a_1 \leq b_1, \, a_2 \leq b_2, \, \ldots, \, a_n \leq b_n .



    Different than: A,B \in \mathbb{R}^{n \times n} satisfying the matrix inequality A \preceq B , which means B - A is positive semidefinite.
    Determining the Feasible set:
    Step 1. (A x = b ) Given A \in \mathbb{R}^{p \times n} , b \in \mathbb{R}^p , then

    \{x: Ax=b \}

    is an affine subspace of \mathbb{R}^n or empty.


    Step 2. Given \gamma \in \mathbb{R}^n, \eta \in \mathbb{R}, then

    \{x:\gamma^Tx \leq \eta \}

    is a half space in \mathbb{R}^n .


    Step 3. (G x \preceq h ) Given

    g_i \in \mathbb{R}^n, \quad G = \begin{bmatrix} g_1^T \\ g_2^T \\ \vdots \\ g_m^T \end{bmatrix} \in \mathbb{R}^{m\times n}, \quad Gx = \begin{bmatrix} g_1^Tx\\g_2^Tx\\ \vdots \\ g_m^Tx \end{bmatrix}

    Step 2. implies

    \{x : G x \preceq h \} = \{x : g_i^Tx\leq h_i, i = 1,\ldots,m\}

    is a finite intersection of half spaces.


    Step 4. Steps 1. and 3. imply the feasible F to (LP) is the finite intersection of half spaces and an affine space, i.e., F is a polyhedron.
    (c.f. Convex Geometry.Polyhedra.)
    Example. Let

    \begin{aligned} c&=\begin{bmatrix}0.1\\1\end{bmatrix},\, G= \begin{bmatrix} 0.8&0.8\\ -1&-1\\ 0&-1\\ 0&1 \end{bmatrix},\, h= \begin{bmatrix} 4\\-3\\-1\\3 \end{bmatrix} \end{aligned}.

    Consider the resulting (LP):

    \text{(LP)} \begin{cases} \text{miminize} & c^T x \\ \text{subject to} & Gx\preceq h \end{cases}.

    Explicitly, this (LP) is given by

    \text{(LP)} \begin{cases} \text{miminize} & 0.1x_1 + x_2\\ \text{subject to} & 0.8x_1 + 0.8x_2 \leq 4\\ & -x_1-x_2 \leq -3\\ &-x_2\leq -1\\ &x_2\leq3 \end{cases}.

    The feasible set F and the graph of the objective function over F are indicated in the image below.
    Remarks.
    1. Given d \in \mathbb{R} , have equivalent problem with objective c^Tx + d .
      Indeed,

      \begin{aligned} \min\{c^Tx + d :x \in F\} &= \min\{ c^Tx: x \in F \} + d\\ \text{argmin}\{c^Tx + d :x \in F\} &=\text{argmin}\{ c^Tx: x \in F \}. \end{aligned}

      WLOG: can assume d=0 to solve problem.


    2. Since

      \begin{aligned} \max\{c^Tx :x \in F\} &= -\min\{ -c^Tx: x \in F \}\\ \text{argmax}\{c^Tx :x \in F\} &=\text{argmin}\{ -c^Tx: x \in F \}. \end{aligned}

      one also calls the problem of maximizing c^Tx over a polyhedron a (LP).
    Example: Integer Linear Programming Relaxation. Integer Linear Program: an (OP) of the form

    \text{(ILP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ & x \in \mathbb{Z}^n \end{cases}

    where

    \begin{aligned} &G \in \mathbb{R}^{m \times n},\qquad h \in \mathbb{R}^m\\ \mathbb{Z}^n : &= \{x \in \mathbb{R}^n : x_i \text{ an integer for each } i=1,\ldots,n\} \end{aligned}

    The constraint x \in \mathbb{Z}^n is suitable for parameters which take on discrete quantities.
    N.B.: An (ILP) is not a convex problem, but may be approximated by one (see below).

    The feasible set Let

    F = \{x \in \mathbb{Z}^n: Gx \preceq h \}

    denote the feasible set of an (ILP).
    Then F is just the collection of integer vectors in the polyhedron

    P = \{ x \in \mathbb{R}^n : Gx \preceq h \} .



    Example. Consider an (ILP) with constraints given by (x,y) \in \mathbb{Z}^2 satisfying

    \begin{aligned} &-x\leq0\\ &-y\leq0\\ &y-2x\leq1\\ &y\leq2.5\\ &y+0.9x\leq4 \end{aligned}.

    The image below depicts the feasible set F (the collection of dots) and polyhedron P (shaded region).
    Remarks.
    1. Can of course also impose equality constraints Ax=b :

      \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ & Ax = b\\ & x \in \mathbb{Z}^n\\ \end{cases}

    2. If we impose x_i \in \{0,1\} , then the problem is called a boolean linear program:

      \text{(BLP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ & x_i \in \{0,1\} \end{cases}

      Suitable for when coordinates of x indicate when something is “off” or “on” or decision is “no” or “yes”.
      Can also use x_i \in \{-1,1\} instead.
    Relaxation of (ILP) The LP

    \text{(LP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h \end{cases}

    is called a relaxation of the (ILP) and is a convex approximation.
    Important points:
    1. The “tightest” convex relaxation is given by

      F=\{ x \in \mathbb{Z}^n : Gx \preceq h \} \to \text{convex hull of } F.

      Generally speaking, finding the convex hull is not an efficient way of approaching (ILPs).
    2. The relaxation (LP) of (ILP) is generally easier to solve, though exact algorithms exist for (ILP).
    3. If

      \begin{aligned} p_{LP}^\star &= \text{ optimal value for (LP)}\\ p_{ILP}^\star &= \text{ optimal value for (ILP)} \end{aligned}

      then p_{LP}^\star \leq p_{ILP}^\star.
      (Indeed the relaxation has larger feasible set.)
    4. If x^\star \in \mathbb{Z}^n solves the (LP), then it solves the (ILP).
    Relaxation of (BLP) Explicitly, a (BLP) is of the form

    \text{(BLP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ &x_i \in \{0,1\}, i=1,\ldots,n \end{cases}.

    The (LP)

    \text{(LP)} \begin{cases} \text{minimize} & c^T x\\ \text{subject to}& Gx \preceq h\\ &0 \leq x_i \leq 1, i=1,\ldots,n \end{cases}

    is called a relaxation of the (BLP).
    N.B.: the relaxation

    F \to \{x \in [0,1]^n: Gx \preceq h \}

    generally provides a better approximation than the relaxation

    F \to \{x \in \mathbb{R}^n: Gx \preceq h \} .

    Indeed, the former biases approximate solutions to be close to being binary.
    Example. Problem Given m workers and n locations with m \leq n ,
    • assign each worker to work at some location
    • assign at most one worker to a location
    • minimize cost of operation and transportation
    Notational set up

    \begin{aligned} c_j &= \text{cost to operate at location }j\\ c_{ij} &= \text{cost to transport worker } i \text{ to location }j\\ x_j &=  \begin{cases} 0 & \text{ if location }j\text{ is not operating}\\ 1 & \text{ if location }j\text{ is operating} \end{cases}\\ x_{ij} &=  \begin{cases} 0 & \text{ if worker }i \text{ is not working at location }j\\ 1 & \text{ if worker }i \text{ is working at location }j \end{cases}. \end{aligned}

    Let

    \begin{aligned} \text{Cost vectors} & \begin{cases} c &= (c_j) \in \mathbb{R}^n\\ C &= (c_{ij}) \in \mathbb{R}^{m \times n} \end{cases}\\ \text{Optimization variables} &\begin{cases} x &= (x_j) \in \{0,1\}^n\\ X &= (x_{ij}) \in \{0,1\}^{m \times n} \end{cases} \end{aligned}.



    Construct objective function
    We find

    \begin{aligned} \text{total operational cost }&= c^Tx = c_1x_1 + \cdots +c_nx_n\\ \text{total transportation cost }&= \text{tr}\,(C^TX) = \sum_{i=1}^m \sum_{j=1}^n c_{ij}x_{ij}\\ \text{total cost }&= c^Tx + \text{tr}\,(C^TX). \end{aligned}

    Thus the objective function is f_0(x,X) = c^Tx + \text{tr}\,(C^TX).

    Construct constraints
    The constraint that x_i,x_{ij} are binary is of course natural for the problem.
    Since each worker is selected only once, we have

    \sum_{j=1}^n x_{ij} = 1, \quad \text{ for each } i=1,\ldots,m.

    Lastly, observe x_j=0 \implies x_{ij}=0 since the j th location not operating means it cannot host a worker; thus we have x_{ij} \leq x_j .

    Formulate Problem
    Putting everything together, the (BLP) formulation of the problem is

    \begin{cases} \text{minimize} & c^Tx + \text{tr}\,(C^TX)\\ \text{subject to} & x \in \{0,1\}^n\\ &X \in \{0,1\}^{m \times n}\\ &\sum_{j=1}^n x_{ij} =1, i = 1,\ldots,m\\ & x_{ij} \leq x_j, i=1,\ldots,m,j=1,\ldots,n \end{cases}.

    An (LP) relaxation (and hence convex approximation) of this (BLP) is

    \begin{cases} \text{minimize} & c^Tx + \text{tr}\,(C^TX)\\ \text{subject to} & x \in [0,1]^n\\ &X \in [0,1]^{m \times n}\\ &\sum_{j=1}^n x_{ij} =1, i = 1,\ldots,m\\ & x_{ij} \leq x_j, i=1,\ldots,m,j=1,\ldots,n \end{cases}.



    Quadratic Programming Quadratic program: an (OP) of the form

    \text{(QP)} \begin{cases} \text{minimize}&\frac{1}{2}x^TQx + q^Tx\\ \text{subject to} &Gx \preceq h\\ &Ax =b \end{cases}

    where

    \begin{aligned} &Q \in \boldsymbol{S}^n, \quad G \in \mathbb{R}^{m \times n}, \quad A \in \mathbb{R}^{p \times n}\\ &q \in \mathbb{R}^n, \quad h \in \mathbb{R}^m,\quad  b \in \mathbb{R}^p. \end{aligned}

    If Q \in \boldsymbol{S}_+^n , then the problem is convex (see remarks).

    Remarks.
    1. As for (LP), the constraints

      \begin{cases} Gx \preceq h\\ Ax =b \end{cases}

      describe a polyhedron.
    2. Given d \in \mathbb{R} , then

      \begin{aligned} \min\{\frac{1}{2}x^TQx + q^Tx + d :x \in F\} &= \min\{ \frac{1}{2}x^TQx + q^Tx: x \in F \} + d\\ \text{argmin}\{\frac{1}{2}x^TQx + q^Tx + d :x \in F\} &= \text{argmin}\{ \frac{1}{2}x^TQx + q^Tx: x \in F \} \\ \end{aligned}

      WLOG: can assume d=0 to solve problem.
    3. If

      f_0(x) = \frac{1}{2}x^TQx + q^Tx ,

      then

      \begin{aligned} \nabla f_0(x) &= Qx + q\\ \nabla^2 f_0(x) &= Q. \end{aligned}

      Thus Q \succeq 0 implies f_0 is convex.
      N.B.: The factor \frac{1}{2} is just a convenient normalization.
    4. The generalization

      \begin{aligned} &Gx \preceq h \, \to \, \frac{1}{2} x^T Q_ix+g_i^Tx \leq h_i \\ &Q_i \in \mathbb{R}^{n\times n}, g_i \in \mathbb{R}^n, h_i \in \mathbb{R}, \quad i =1,\ldots,m \\ \end{aligned}

      results in quadratically constrained quadratic programming (QCQP).
      Imposing Q,Q_i \in \boldsymbol{S}_+^n ensures the (QCQP) is convex.
      (C.f., Convex Function Theory.Level Sets.)
    5. The generalization

      \begin{aligned} &Ax=b \, \to \, \frac{1}{2} x^T P_ix+a_i^Tx =b_i  \\ &P_i \in \mathbb{R}^{n\times n}, a_i \in \mathbb{R}^n, b_i \in \mathbb{R}, \quad i =1,\ldots,m \\ \end{aligned}

      also results in a (QCQP), but this can break convexity.
      E.g., The quadratic constraints

      x_i(x_i-1)=0 \text{ or } x_i^2 = 1

      result in a (BLP) since these constraints enforce x_i \in \{0,1\} or x_i \in \{-1,1\}, respectively.
    6. There holds

      \begin{aligned} \text{(QCQP)} + Q_i = P_i= 0  &\iff (QP)\\ (QP) + Q=0 &\iff (LP). \end{aligned}


    7. The assumption Q \in \boldsymbol{S}^n (i.e., Q = Q^T ) is not a serious restriction.
      Indeed, first note that, since x^TQx is a scalar, we have

      x^TQx = (x^TQx)^T = x^TQ^Tx .



      Let

      f(x) = \frac{1}{2}x^TQx + q^T x .



      Thus,

      \begin{aligned} f(x) &= \frac{f(x) + f(x)}{2}\\ &= \frac{1}{2}\left( \frac{1}{2}x^T Q x + q^Tx + \frac{1}{2}x^T Q^T x + q^Tx \right)\\ &=\frac{1}{2}\left( \frac{1}{2}x^T(Q+Q^T)x + 2 q^Tx \right)\\ &=:\frac{1}{2}x^T \tilde{Q} x + q^T x, \end{aligned}

      where

      \tilde{Q} = \frac{1}{2}(Q+Q^T) .



      Lastly, observe the symmetry of \tilde{Q} :

      \tilde{Q}^T = \left( \frac{1}{2}(Q+Q^T) \right)^T = \frac{1}{2}(Q^T+Q) = \tilde{Q}.



      Thus every quadratic

      f(x) = \frac{1}{2}x^TQx + q^Tx

      has a “symmetric representation” of the form

      f(x) = \frac{1}{2}x^T\tilde{Q}x + q^Tx with \tilde{Q} \in \boldsymbol{S}^n.

    Example: Least Squares Least squares: an unconstrained (QP) of the form

    \text{(LS)} \begin{cases} \text{minimize}&\Vert Ax-b\Vert_2^2= x^TA^TAx-2b^TAx + b^Tb \end{cases}

    where A \in \mathbb{R}^{m \times n} and b \in \mathbb{R}^m .
    Features:
    • WLOG: may assume columns of A are linearly independent, and so

      m \geq n .

    • N.B.: a least norm solution to (LS) is generally given by

      x^\star = A^{\dagger}b ,

      where A^\dagger is the pseudo-inverse (aka Moore-Penrose inverse) of A .
    • Under the WLOG assumption, there holds

      x^\star = (A^TA)^{-1}A^Tb .



    Recall (definition of \Vert \cdot \Vert_2 ) For x \in \mathbb{R}^n , the notation \Vert x \Vert_2 means the vector norm

    \Vert x \Vert_2 := \sqrt{x_1^2 + \cdots + x_n^2 } .

    Thus

    \Vert x- y \Vert_2

    is the distance between the two vectors x,y \in \mathbb{R}^n .
    Rough Justification of WLOG Let

    \begin{aligned} A  &= \begin{bmatrix} a_1 & a_2 & a_3 \end{bmatrix} \in \mathbb{R}^{2 \times 3}, \quad a_1,a_2,a_3 \in \mathbb{R}^2 \\ x &=  \begin{bmatrix}  x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad b =  \begin{bmatrix} b_1 \\ b_2 \end{bmatrix}. \end{aligned}

    N.B.: a set of three 2-dimensional vectors is always linearly dependent and so

    a_3 = c_1 a_1 + c_2 a_2

    for some c_1,c_2 \in \mathbb{R}
    Therefore, we may write

    A=: \begin{bmatrix} a & \alpha & u \\ b & \beta & v  \end{bmatrix} = \begin{bmatrix} a & \alpha & c_1 a + c_2 \alpha\\ b & \beta & c_1 b + c_2 \beta \end{bmatrix} .



    Computing

    \begin{aligned} Ax &= \begin{bmatrix} a & \alpha & c_1 a + c_2 \alpha\\ b & \beta & c_1 b + c_2 \beta \end{bmatrix} \begin{bmatrix}  x_1 \\ x_2 \\ x_3 \end{bmatrix}\\ &= \begin{bmatrix} ax_1 + \alpha x_2 + c_1 a x_3 + c_2 \alpha x_3\\ bx_1 + \beta x_2 + c_1 b x_3 + c_2 \beta x_3 \end{bmatrix}\\ &= \begin{bmatrix} a(x_1 + c_1 x_3) + \alpha (x_2 + c_2 x_3)\\ b(x_1 + c_1 x_3) + \beta(x_2 + c_2 x_3) \end{bmatrix} \end{aligned}

    we see that minimizing

    \Vert Ax-b \Vert_2^2 = \bigg\Vert \begin{bmatrix} a(x_1 + c_1 x_3) + \alpha (x_2 + c_2 x_3)\\ b(x_1 + c_1 x_3) + \beta(x_2 + c_2 x_3) \end{bmatrix} - \begin{bmatrix} b_1\\b_2 \end{bmatrix} \bigg\Vert_2^2

    is equivalent to minimizing

    \begin{aligned} \Vert \hat{A}y - b \Vert_2^2 &:= \bigg\Vert \begin{bmatrix} a&\alpha\\b&\beta \end{bmatrix} \begin{bmatrix}y_1\\y_2\end{bmatrix}-\begin{bmatrix}b_1\\b_2\end{bmatrix} \bigg\Vert_2^2\\ &= \bigg\Vert \begin{bmatrix} ay_1 + \alpha y_2\\ by_1 + \beta y_2 \end{bmatrix} -\begin{bmatrix}b_1\\b_2\end{bmatrix} \bigg\Vert_2^2 \end{aligned}.

    Equivalence follows from the change of variables

    \begin{aligned} y_1 &= x_1 + c_1 x_3\\ y_2 &= x_2 + c_2 x_3  \end{aligned} .



    Therefore, a (LS) problem with matrix of size 2 \times 3 is equivalent to a (LS) problem with matrix of size 2 \times 2 .

    This argument holds in general: if A has linearly dependent columns, can use change of variables to ensure linear independence.
    Remarks.
    1. If Ax=b has solution x^\star , then x^\star solves the (LS) problem.

    2. If Ax=b has no solution, then any x^\star solving the (LS) problem gives a “best” estimate solution to Ax = b, where “best” is chosen to mean in terms of the vector norm \Vert \cdot \Vert_2 .

    3. While minimizing \Vert \cdot \Vert_2^2 is equivalent to minimizing \Vert \cdot \Vert_2 , the exponent 2 ensures \Vert \cdot \Vert_2^2 is differentiable (at 0 ).
      (C.f.: |x| is not differentiable at x=0 , but x^2 is.)
    Solving the (LS)
    Step 1. (The problem is convex) The objective function

    f_0(x) = \Vert Ax-b\Vert_2^2 = x^TA^TAx - 2b^TAx+b^Tb

    is convex since

    \nabla^2 f_0 = 2 A^TA \succeq 0 .

    To see A^TA \succeq 0 , observe:
    1. (A^T A)^T = A^T A and so A^TA \in \boldsymbol{S}^n , i.e., A^TA is symmetric;
    2. for all z \in \mathbb{R}^n there holds

      z^T A^T A z = (Az)^T Az = \Vert Az \Vert_2^2 \geq 0 ,

      i.e., A^TA \in \boldsymbol{S}_+^n .
    Step 2. (The critical points) We compute

    \begin{aligned} \nabla f_0 &= \nabla (x^T A^TA x) - \nabla(2b^TAx) + \nabla(b^Tb)\\ &= 2 A^TAx - 2A^Tb + 0\\ &= 2(A^TAx - A^Tb). \end{aligned}

    Therefore, \nabla f_0 = 0 iff

    A^TAx = A^T b .

    This system of equations consists of what are called the normal equations.
    Step 3. (The solution) By corollary proved above: x^\star is a solution iff \nabla f_0(x^\star) = 0 .
    Moreover, A having linearly independent columns ensures that A^TA is invertible:
    1. linear independent columns \implies Az=0 iff z=0.
    2. A^TAz = 0 \iff Az=0 \iff z = 0 .
    3. Since A^TA is square, can conclude A^TA is invertible.


    Lastly: since

    \nabla f_0(x^\star)= 0 iff A^TA x^\star = A^T b

    and since A^TA is invertible, we conclude that the solution to the (LS) is

    x^\star = (A^TA)^{-1}A^Tb .

    Example: Distance Between Convex Sets Let K_1,K_2 \subset \mathbb{R}^n be convex subsets.
    Nearest Point Problem (NPP): Among x \in K_1, y \in K_2 , which pair (x,y) minimizes the distance \Vert x-y \Vert_2^2 ?



    The (NPP) may be expressed as a standard form (COP).
    Indeed, supposing

    \begin{aligned} K_1 &= \{ f_1(x) \leq 0 \}\\ K_2 &= \{ f_2(x) \leq 0 \}, \end{aligned}

    with f_1,f_2 convex, then the (NPP) for K_1,K_2 is the (COP)

    \begin{cases} \text{minimize} & \Vert x - y \Vert_2^2\\ \text{subject to} & f_1(x) \leq 0 \\ &f_2(y) \leq 0 \end{cases}.

    N.B.: the feasible set is

    \begin{aligned} F&=\{(x,y) \in \mathbb{R}^n \times \mathbb{R}^n : f_1(x) \leq 0, f_2(y) \leq 0 \}\\ &=K_1 \times K_2. \end{aligned}

    Viz.: (x,y) \in F iff x \in K_1, y \in K_2 .

    When the K_1,K_2 are polyhedra, then the (NPP) is a (QP).
    Indeed, supposing

    \begin{aligned} K_1 &= \{ G_1 x \preceq h_1 \}\\ K_2 &= \{ G_2 x \preceq h_2 \}, \end{aligned}

    then the (NPP) for K_1,K_2 is the (QP) given by

    \begin{cases} \text{minimize} & \Vert x - y \Vert_2^2\\ \text{subject to} & G_1 x \preceq h_1\\ & G_2 y \preceq h_2 \end{cases}.

    Can formulate as constrained least squares problem: defining

    \begin{aligned}  B,C \in \mathbb{R}^{n \times n} &\mapsto B \oplus C =  \begin{bmatrix} B & \vline & 0\\ \hline 0 & \vline & C \end{bmatrix} \in \mathbb{R}^{2n \times 2n}, \\ v,w \in \mathbb{R}^n &\mapsto  v \oplus w = \begin{bmatrix} v_1 & \cdots & v_n & w_1 & \cdots & w_n \end{bmatrix}^T \in \mathbb{R}^{n + n}\\ A &= \begin{bmatrix} Id_{n\times n} &\vline & - Id_{n\times n}\\ \hline 0 &\vline& 0 \end{bmatrix} \in \mathbb{R}^{2n\times 2n} \end{aligned}

    we readily see that the problem is equivalent to

    \begin{cases} \text{minimize} & \Vert A(x\oplus y) \Vert_2^2\\ \text{subject to} & (G_1 \oplus G_2) (x \oplus y) \preceq h_1 \oplus h_2 \end{cases}.



    Polyhedral Approximation. Let K_1,K_2 be nonempty convex domains, not necessarily polyhedral. Let P_1,P_2,Q_1,Q_2 be polyhedral domains such that

    \begin{aligned} &P_1 \subset K_1 \subset Q_1\\ &P_2 \subset K_2 \subset Q_2. \end{aligned}

    Then the (NPP) for the pairs [P_1,P_2] and [Q_1,Q_2] provide (QP) relaxations of the (NPP) for [K_1,K_2] and, respectively, provide upper and lower bounds for the optimal values for the original (NPP).
    Geometric Programming Monomial function: given

    c \geq 0,\, a_i \in \mathbb{R},\, i =1,\ldots,n ,

    a function of the form

    \begin{aligned} f&: \mathbb{R}^n \to \mathbb{R}\\ f(x)&= c x_1^{a_1} x_2^{a_2} \cdots x_n^{a_n}\\ \text{dom}\,f&=\mathbb{R}_{++}^n. \end{aligned}

    Example.

    3x_1^2x_2^{3.4}x_3^{-\pi}.

    Posynomial: given

    c_k \geq 0,\, a_{ik}\in \mathbb{R},\, i=1,\ldots,n,\, k=1,\ldots,K ,

    a function of the form

    \begin{aligned} f&: \mathbb{R}^n \to \mathbb{R}\\ f(x)&= \sum_{k=1}^K c_k x_1^{a_{1k}} x_2^{a_{2k}} \cdots x_n^{a_{nk}}\\ \text{dom}\,f&=\mathbb{R}_{++}^n. \end{aligned}

    Example.

    f(x) = \sqrt{2}x_1x_2\sqrt{x_3} + 3 x_1^4x_2^{-4}x_3^{-1.5}+x_5

    Geometric Programming: An (OP) of the form

    \text{(GP)} \begin{cases} \text{minimize}&f_0(x)\\ \text{subject to}&f_i(x) \leq 1, \, i=1,\ldots,m\\ & h_i(x) = 1, \, i=1,\ldots,p \end{cases},

    where

    \begin{aligned} f_0,f_1,\ldots,f_m & \text{ are posynomials}\\ h_1,\ldots,h_p & \text{ are monomials}. \end{aligned}

    N.B.: this is an undercover (COP).
    Remarks.
    1. Let

      \begin{aligned} Posy_n &= \text{ set of posynomials on } \mathbb{R}_{++}^n\\ Mon_n &= \text{ set of monomials on } \mathbb{R}_{++}^n. \end{aligned}

      Since any monomial is a posynomial, we have Mon_n \subset Posy_n .
    2. If

      \begin{aligned}  \lambda \geq 0\\ p(x),q(x) &\in Posy_n\\ f(x),g(x) &\in Mon_n , \end{aligned}

      then

      \begin{aligned} \lambda p(x),\, p(x) + q(x),\, p(x) q(x),\, \frac{p(x)}{f(x)} \in Posy_n\\ \lambda f(x),\, f(x)g(x),\, \frac{f(x)}{g(x)} \in Mon_n. \end{aligned}

      Thus Mon_n is a group and Posy_n a “conic representation” of Mon_n .
    3. As usual, we use the language “writing in standard form” to refer to writing an equivalent (OP) written in the form (GP) above.

      General (OPs) clearly equivalent to a (GP) may be called a geometric program in nonstandard form.

      For example, the geometric program

      \begin{cases} \text{maximize}&f_0(x)\\ \text{subject to} & f_i(x) \leq g_i(x)\\ & h_i(x) = k_i(x) \end{cases}

      with

      \begin{aligned} f_i  &\text{ are posynomials}\\ f_0,h_i,g_i\not\equiv 0 ,k_i\not\equiv0 & \text{ are monomials} \end{aligned}

      is readily rewritten as a standard form (GP):

      \begin{cases} \text{minimize}&\frac{1}{f_0(x)}\\ \text{subject to} & \frac{f_i(x)}{g_i(x)} \leq 1\\ & \frac{h_i(x)}{k_i(x)} = 1 \end{cases}.

    Rewriting (GP) as a (COP) General (GPs) are not convex (e.g., f_0(x) = \sqrt{x} ).
    However, any (GP) is easily recast as a (COP) via change of variable.

    Step 1. (The change of variable) We will write x \mapsto y to mean the change of variable given by

    x_i = e^{y_i}.

    Step 2. (Monomials \to convex function) Let

    \begin{matrix} c>0, \quad b = \log c,& f(x) \in Mon_n,\\  a = \begin{bmatrix} a_1 & \cdots & a_n \end{bmatrix}^T \in \mathbb{R}^n,\quad & f(x) = c x_1^{a_1}x_2^{a_2} \cdots x_n^{a_n}. \end{matrix}

    Under the change of variable x \mapsto y:

    \begin{aligned} f(x) &= f(x_1,\ldots,x_n)\\ &= f(e^{y_1},\ldots, e^{y_n})\\ &= c(e^{y_1})^{a_1} \cdots (e^{y_n})^{a_n}\\ &= e^{\log c}e^{a_1y_1}\cdots e^{a_ny_n}\\ &= e^{a^Ty+b} \end{aligned}

    But

    F\, convex \implies e^F\, convex

    and so e^{a^Ty+b} is convex since affine functions are convex.
    Step 3. (Posynomial \to convex function) For k=1,\ldots,K , let

    \begin{matrix} c_k>0, \quad b_k = \log c_k,& f(x) \in Posy_n,\\  a_k = \begin{bmatrix} a_{1k} & \cdots a_{nk} \end{bmatrix} \in \mathbb{R}^{n},\quad & f(x) = \sum_{k=1}^K c_k x_1^{a_{1k}}\cdots x_{n}^{a_{nk}}. \end{matrix}

    By Step 2., there holds

    \begin{aligned} f(x) = f(y) = \sum_{k=1}^K e^{a_k^Ty + b_k}, \end{aligned}

    which is again a convex function.
    Step 4. ((GP) \to (COP)) We explicitly write the (GP) as:

    \begin{cases} \text{minimize}&f_0(x) = \sum_{k=1}^{K_0} c_{0k} x_1^{a_{01k}}\cdots x_n^{a_{0nk}}\\ \text{subject to}& f_i(x) = \sum_{k=1}^{K_i} c_{ik} x_1^{a_{i1k}} \cdots x_n^{a_{ink}} \leq 1, \, i=1,\ldots,m\\ & h_i(x) =d_{i}x_1^{b_{i1}}\cdots x_n^{b_{in}} = 1, \, i=1,\ldots,p \end{cases}.

    Let

    \begin{aligned} a_{ik} &= \begin{bmatrix} a_{i1k} & \cdots a_{ink} \end{bmatrix}^T \in \mathbb{R}^n\\ b_i &= \begin{bmatrix} b_{i1} & \cdots & b_{in} \end{bmatrix}^T \in \mathbb{R}^n\\ \alpha_{ik} &= \log c_{ik}\\ \delta_i &= \log d_{i}  \end{aligned}.

    Under the change of variable x \mapsto y , this (GP) becomes the (COP)

    \begin{cases} \text{minimize}&f_0(y) = \sum_{k=1}^{K_0} e^{a_{0k}^T y + \alpha_{0k}}\\ \text{subject to}& f_i(y) = \sum_{k=1}^{K_i}  e^{a_{ik}^T y + \alpha_{ik}} \leq 1, \, i=1,\ldots,m\\ & h_i(y) = e^{b_{i}^Ty + \delta_i} = 1, \, i=1,\ldots,p \end{cases}.

    Step 5. ((GP) in convex form) At last, since exponentiation may result in unreasonably large numbers, it is customary to take logarithms, resulting in the geometric problem in convex form:

    \begin{cases} \text{minimize}& \log \left( \sum_{k=1}^{K_0} e^{a_{0k}^T y + \alpha_{0k}} \right)\\ \text{subject to}& \log \left(\sum_{k=1}^{K_i}  e^{a_{ik}^T y + \alpha_{ik}}\right) \leq 0, \, i=1,\ldots,m\\ & b_{i}^Ty + \delta_i = 0, \, i=1,\ldots,p \end{cases}.

    N.B.:
    1. Concavity of \log is too weak to break the convexity of e^{a^Ty+b}, and so the problem is still convex.
    2. The constraints are equivalent since \log is monotonic and injective on \mathbb{R}_{++}.
    Example. (Taken from Boyd-Kim-Vandenberghe-Hassibi: A tutorial on geometric programming)
    Problem Maximize the volume of a box with
    • a limit on total wall area;
    • a limit on total floor area; and
    • upper and lower bounds on the aspect ratios height/width and depth/width .
    Notational set up

    \begin{aligned} \text{Optimization Variables}& \begin{cases} w &= \text{width}\\ h &= \text{height}\\ d &= \text{depth} \end{cases}\\ \text{Problem Parameters}& \begin{cases} A_{\text{wall}} &= \text{max wall area}\\ A_{\text{floor}} &= \text{max floor area}\\ \alpha_{-},\alpha_+ &= \text{lower and upper aspect ratio bounds for }h/w\\ \beta_{-},\beta_+ &= \text{lower and upper aspect ratio bounds for }d/w \end{cases} \end{aligned}

    Construct objective function
    The volume of the box is

    hwd

    and so the objective function is

    f_0(h,w,d) = hwd .

    N.B.: f_0 \in Mon_3



    Construct contraints

    \begin{aligned} \text{wall area bound }:&\quad 2hw+2hd \leq A_{\text{wall}}\\ \text{floor area bound }:&\quad wd \leq A_{\text{floor}}\\ \text{aspect ratio bounds }:&\quad \alpha_1 \leq \frac{h}{w} \leq \alpha_2\\ &\quad \beta_1 \leq \frac{d}{w} \leq \beta_2 \end{aligned}

    N.B.:

    \begin{aligned} 2hw+2hd &\in Posy_3\\ wd,\, hw^{-1},\, dw^{-1} &\in Mon_3 \end{aligned}

    Formulate Problem
    Putting everything together, we realize the problem may be formulated as the following (GP):

    \begin{cases} \text{maximize} & hwd\\ \text{subject to} & 2hw+2hd\leq A_{\text{wall}}\\ & wd \leq A_{\text{floor}}\\ & \alpha_1 \leq hw^{-1} \leq \alpha_2\\ & \beta_1 \leq dw^{-1} \leq \beta_2 \end{cases}.

    To write the problem in standard form: note the following equivalence of constraints

    \begin{aligned} 2hw+2hd\leq A_{\text{wall}} & \iff A_{\text{wall}}^{-1}2hw + A_{\text{wall}}^{-1}2hd \leq 1\\  wd \leq A_{\text{floor}} & \iff A_{\text{floor}}^{-1}wd \leq 1\\ &\\ \alpha_1 \leq hw^{-1} \leq \alpha_2 &\iff \begin{array}{l} \alpha_2^{-1}hw^{-1} \leq 1 \\  \alpha_1 h^{-1}w \leq 1 \end{array}\\ \beta_1 \leq dw^{-1} \leq \beta_2 & \iff \begin{array}{l} \beta_2^{-1} dw^{-1} \leq 1\\ \beta_1 d^{-1}w \leq 1 \end{array} \end{aligned}

    Moreover, maximizing hwd is equivalent to minimizing h^{-1}w^{-1}d^{-1} . Therefore, the problem in standard form is given by

    \begin{cases} \text{minimize} & h^{-1}w^{-1}d^{-1}\\ \text{subject to} & A_{\text{wall}}^{-1}2hw +A_{\text{wall}}^{-1}2hd \leq 1\\ & A_{\text{floor}}^{-1} wd \leq 1\\ &  \alpha_2^{-1} hw^{-1}\leq 1 \\ & \alpha_1 wh^{-1} \leq 1\\ & \beta_2^{-1}dw^{-1} \leq 1\\ & \beta_1 wd^{-1} \leq 1 \end{cases}.

    Semidefinite Programming (Heavily influenced by Vandenberghe-Boyd Semidefinite Programming.)
    Linear matrix inequality (LMI): given

    \begin{aligned}  &F_0,F_1,\ldots,F_n \in \boldsymbol{S}^m\\ &x = \begin{bmatrix}x_1 & \cdots & x_n \end{bmatrix}^T \in \mathbb{R}^n\\ F(x) &:= F_0 + x_1 F_1 + \cdots + x_n F_n \end{aligned}

    an inequality of the form

    F(x) \succeq 0 .

    Recall: for A \in \boldsymbol{S}^m , we write A \succeq 0 to mean A is positive semidefinite, i.e., z^TAz \geq 0 for all z \in \mathbb{R}^m.

    Semidefinite program (SDP): an (OP) of the form

    \text{(SDP)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & F(x) \succeq 0, \end{cases}

    where

    \begin{aligned}  &F_0,F_1,\ldots,F_n \in \boldsymbol{S}^m\\ F(x) &:= F_0 + x_1 F_1 + \cdots + x_n F_n\\ c & \in \mathbb{R}^n \end{aligned}

    N.B.: The LMI F(x) \succeq 0 defines a feasible set which is convex and hence (SDPs) are convex problems.

    Convexity of Feasible Set. To see that (SDP) is a convex problem, first note: if

    t>0 and A \succeq 0,

    then

    z^T(tA)z = t ( z^TAz) \geq 0

    and so

    tA \succeq 0 .

    Next, observe: for x,y feasible and t \in [0,1] , the function

    F(x) = F_0 + x_1 F_1 + \cdots + x_n F_n

    evaluated at the convex combination

    tx + (1-t)y

    is

    \begin{aligned}  F(tx +(1-t)y) &= F_0 + (tx_1 + (1-t)y_1) F_1 + \cdots + (tx_n + (1-t)y_n)F_n. \end{aligned}

    Expanding, rearranging and using

    F_0 = tF_0 + (1-t)F_0

    gives:

    \begin{aligned}  F(tx +(1-t)y) &= tF_0  + tx_1F_1 + \cdots tx_nF_n \\ &+ (1-t)F_0 + (1-t)y_1F_1 + \cdots  + (1-t)y_nF_n\\ &= tF(x) + (1-t)F(y). \end{aligned}

    Using F(x),F(y)\succeq 0 , we conclude

    F(tx+(1-t)y) = tF(x) + (1-t)F(y) \succeq 0 .

    Thus, x,y feasible \implies tx + (1-t)y feasible for t \in [0,1], i.e., the feasible set is convex.
    Example 1. LPs are SDPs Consider the (LP)

    \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax + b \succeq 0  \end{cases},

    where

    A \in \mathbb{R}^{m \times n}, \quad b \in \mathbb{R}^{m}, \quad c \in \mathbb{R}^n

    and

    Ax+b \succeq 0

    means componentwise inequality.

    Given v = \begin{bmatrix}v_1 & \cdots & v_m \end{bmatrix}^T \in \mathbb{R}^m , define

    \text{diag}(v) =  \begin{bmatrix} v_1 & 0 & \cdots & 0\\ 0 & v_2 &\cdots &0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & v_m \end{bmatrix}.



    Since A \in \boldsymbol{S}^m satisfies A \succeq 0 iff A has nonnegative eigenvalues, we have

    v \succeq 0 \iff \text{diag}(v) \succeq 0.

    (Indeed, the eigenvalues of \text{diag}(v) are the components of v .) Therefore,

    \begin{aligned} Ax+b \succeq 0  &\iff \text{diag}(Ax+b) \succeq 0 \\ \text{(vector inequality)} & \iff \text{(matrix inequality)}. \end{aligned}



    Letting

    A=\begin{bmatrix}a_1 & \cdots & a_n \end{bmatrix}, \quad a_i \in \mathbb{R}^m ,

    we have

    Ax + b = b + x_1 a_1 + \cdots + x_n a_n .



    Therefore, using

    \begin{aligned} \text{diag}(v+\lambda w) &=  \begin{bmatrix}  v_1 + \lambda w_1 &0 & \cdots & 0\\ 0 & v_2 + \lambda w_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & v_n + \lambda w_n  \end{bmatrix}\\ &= \begin{bmatrix}  v_1  &0 & \cdots & 0\\ 0 & v_2  & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & v_n  \end{bmatrix} + \lambda \begin{bmatrix}  w_1 &0 & \cdots & 0\\ 0 &  w_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots &  w_n  \end{bmatrix}\\ &= \text{diag}(v) + \lambda \text{diag}(w) \end{aligned}

    we have

    \begin{aligned}  \text{diag}(Ax + b) &= \text{diag}(b + x_1 a_1 + \cdots + x_n a_n)\\ &= \text{diag}(b) + x_1 \text{diag}(a_1) + \cdots + x_n \text{diag}(a_n). \end{aligned}



    Therefore, defining

    \begin{aligned}  F_0 &= \text{diag}(b), \quad F_i = \text{diag}(a_i)\\ F(x) &= F_0 + x_1 F_1 + \cdots + x_n F_n = \text{diag}(Ax+b), \end{aligned}

    we conclude

    \begin{aligned} Ax+b \succeq 0 \iff \text{diag}(Ax+b) \succeq 0 \iff F(x) \succeq 0 \end{aligned} .



    In conclusion, we have that the (LP)

    \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & Ax + b \succeq 0  \end{cases},

    is equivalent to the (SDP)

    \begin{cases} \text{minimize} & c^Tx\\ \text{subject to} & F(x)= \text{diag}(Ax+b) \succeq 0  \end{cases}.

    Example 2. Nonlinear OP as a SDP Consider the nonlinear (COP)

    \text{(OP1)} \begin{cases} \text{minimize}& \frac{(c^Tx)^2}{d^Tx}\\ \text{subject to} & Ax + b  \succeq 0\\ & d^Tx > 0 \end{cases}

    where

    \begin{aligned} &c,d \in \mathbb{R}^n\\ &A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m. \end{aligned}

    We will recast (OP1) as a (SDP).
    N.B.:
    1. d^Tx > 0 \implies choice of domain (a halfspace) of objective and ensures convexity.
    2. d^Tx < 0 \implies concave problem.
    To begin, recall we may recast problem in epigraph form:

    \text{(OP1)} \begin{cases} \text{minimize}& t\\ \text{subject to} & Ax + b  \succeq 0\\ & d^Tx > 0\\ & \frac{(c^Tx)^2}{d^Tx} \leq t \end{cases}.

    N.B.: this introduces the new optimization variable t . Goal: find a symmetric matrix-valued function

    F(x,t) = F_0 + x_1F_1 + \cdots + x_nF_n + t F_{n+1}

    such that the constraints

    \begin{cases} &Ax + b  \succeq 0\\ & d^Tx > 0\\ & \frac{(c^Tx)^2}{d^Tx} \leq t \end{cases}

    may be recast as the LMI

    \begin{cases} F(x) \succeq 0 \end{cases}.

    Idea: we know

    Ax + b  \succeq 0 \iff \text{diag}(Ax+b) \succeq 0 .

    On the other hand:

    \begin{aligned} \frac{(c^Tx)^2}{d^Tx} \leq t &\iff (c^Tx)^2 \leq td^Tx \\ &\iff td^Tx - (c^Tx)^2 \geq0  \end{aligned} .

    Recall: if \gamma>0 , then

    \begin{bmatrix}\alpha&\beta\\\beta&\gamma\end{bmatrix}\succeq0 \iff \alpha\geq0, \alpha\gamma - \beta^2 \geq 0

    Therefore, given d^Tx>0, we have

    \begin{aligned} \frac{(c^Tx)^2}{d^Tx} \leq t & \iff \begin{bmatrix}t & c^Tx\\ c^Tx & d^Tx \end{bmatrix} \succeq 0 \end{aligned} .

    Using that

    \begin{bmatrix} A &\vline &0\\ \hline 0 & \vline& B \end{bmatrix} \succeq 0 \iff A \succeq 0 , B \succeq 0,

    we therefore introduce

    E =  \begin{bmatrix} \text{diag}(Ax+b) & 0 & 0\\ 0 & t  & c^Tx\\ 0 & c^Tx & d^Tx \end{bmatrix}

    to capture the problems constraints.
    Indeed, evidently, E \succeq 0 iff

    \text{diag}(Ax+b) \succeq 0 and \begin{bmatrix} t & c^tx\\c^tx & d^tx \end{bmatrix} \succeq 0 .

    Therefore,

    \begin{array}{l}  Ax + b  \succeq 0\\  d^Tx > 0\\  \frac{(c^Tx)^2}{d^Tx} \leq t \end{array} \iff \begin{array}{l} \begin{bmatrix} \text{diag}(Ax+b) & 0 & 0\\ 0 & t  & c^Tx\\ 0 & c^Tx & d^Tx \end{bmatrix} \succeq 0 \end{array} .

    This is enough to conclude (OP1) may be recast as an (SDP).

    To make it clearer, introduce the notation

    A = \begin{bmatrix} a_1 & \cdots & a_n \end{bmatrix}, \quad a_i \in \mathbb{R}^m

    and (m+2)\times(m+2) matrices

    \begin{aligned} F_0 &= \text{diag}(b,0,0)  \\ F_i & =  \begin{bmatrix} \text{diag}(a_i) &0 &0\\ 0&0&c_i\\ 0&c_i&d_i \end{bmatrix}\\ F_{n+1} & = \begin{bmatrix}0_{m \times m} &0 &0\\0 & 1 & 0\\0 & 0 & 0\end{bmatrix}\\ \end{aligned}.

    Then

    \begin{aligned} \begin{bmatrix} \text{diag}(Ax+b) & 0 & 0\\ 0 & t  & c^Tx\\ 0 & c^Tx & d^Tx \end{bmatrix} = F_0 + x_1F_1 + \cdots + x_nF_n + t F_{n+1}:=F(x,t) \end{aligned}

    and so (OP1) is equivalent to the (SDP)

    \begin{cases} \text{minimize}& t\\ \text{subject to} & F(x,t) \succeq 0. \end{cases}

    Lagrangian Duality

    Throughout, let (OP) denote a given optimization problem of the form

    \text{(OP)} \begin{cases} \text{minimize} & f_0(x)\\ \text{subject to} & f_i(x) \leq 0 , \quad i=1,\ldots,m\\ & h_i(x) = 0, \quad i = 1,\ldots,p \end{cases}.

    Recall:

    \begin{aligned} \text{Problem domain: }& D := \bigcap_{i=0}^m \text{dom}\, f_i \cap \bigcap_{i=1}^p \text{dom}\,h_i\\ \text{Optimal value: } & p^\star := \inf\{f_0(x): x \in D , x \text{ feasible}\}. \end{aligned}

    The Lagrange Dual Lagrangian: the function

    \begin{aligned}  L&:\mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R} \\ \text{dom}\,L &= D \times \mathbb{R}^m \times \mathbb{R}^p \end{aligned}

    given by

    \begin{aligned} L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \end{aligned} .

    N.B.: for each fixed x \in D , the function

    (\lambda,\nu) \mapsto L(x,\lambda,\mu)

    is affine.

    Lagrange multipliers: the variables \lambda_i and \nu_i .
    The vectors

    \begin{aligned} \lambda :=  \begin{bmatrix}\lambda_1\\\lambda_2\\\vdots\\\lambda_m\end{bmatrix}, \quad  \nu := \begin{bmatrix}\nu_1\\\nu_2\\\vdots\\\nu_p\end{bmatrix} \end{aligned}

    are called dual variables.

    Lagrange dual function: the function

    \begin{aligned} g&:\mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R}\\ \end{aligned}

    given by

    g(\lambda,\nu) = \inf\{L(x,\lambda,\nu): x \in D\} .

    Those (\lambda,\nu) \in \mathbb{R}^m_+ \times \mathbb{R}^m satisfying

    g(\lambda,\nu) > -\infty

    are called dual feasible.
    N.B.: as an infimum of affine functions, g is automatically concave.

    Proposition. For

    \lambda \in \mathbb{R}^m_+, \quad \nu \in \mathbb{R}^p ,

    there holds

    g(\lambda,\nu) \leq p^\star,

    where p^\star is the optimal value for the given (OP).
    Proof.
    1. Let x \in D be feasible.
      Then

      \begin{aligned} f_i(x) &\leq 0, \quad i=1,\ldots,m\\ h_i(x) &=0, \quad i=1,\ldots,p. \end{aligned}



    2. Let \lambda \succeq 0 and \nu be arbitrary.
      Then feasibility of x implies

      \begin{aligned} \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) = \sum_{i=1}^m \lambda_i f_i(x) \leq 0. \end{aligned}

      Consequently,

      \begin{aligned}  L(x,\lambda,\nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \leq f_0(x). \end{aligned}



    3. Therefore, for all feasible x , for \lambda \succeq 0 and for arbitrary \nu , there holds

      \begin{aligned} g(\lambda,\nu) = \inf\{L(z,\lambda,\nu):z \in D \} \leq L(x,\lambda,\nu) \leq f_0(x) \end{aligned}

      and so

      g(\lambda,\nu) \leq p^\star .

      (Indeed, g(\lambda,\nu) is a lower bound of f_0(x) and p^\star is the greatest lower bound of f_0(x) .)


    Lagrangian as underestimator.
    (See CO 5.1.4)
    Define the indicator functions

    I_-(t)= \begin{cases} 0 &t \leq 0\\ +\infty & t> 0 \end{cases},\qquad I_0(t)= \begin{cases} 0 & t=0\\ +\infty & t\neq0 \end{cases}.

    Then

    \begin{aligned} I_-(f_i(x)) &\text{ indicates when the constraint }f_i \text{ is active}\\ I_0(h_i(x)) &\text{ indicates when the constraint }h_i \text{ is active} \end{aligned}

    and the (OP) is equivalent to

    \begin{aligned} \begin{cases} \text{ minimize } f_0(x) + \sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x)). \end{cases} \end{aligned}

    N.B.: the terms in

    \begin{aligned}\sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x)) \end{aligned}

    act as penalties for breaking the desired constraints.

    N.B.: if x is feasible and (\lambda,\nu) \in \mathbb{R}_+^m\times\mathbb{R}^p , then

    \begin{aligned} \lambda_i f_i(x) &\leq I_-(f_i(x))\\ \nu_i h_i(x) &\leq I_0(h_i(x)) \end{aligned}

    and hence

    \begin{aligned}L(x,\lambda,\nu) & = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \\ &\leq f_0(x) + \sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x))\end{aligned} .

    Viz., L is an underestimater of the objective function

    f_0(x) + \sum_{i=1}^mI_-(f_i(x)) + \sum_{i=1}^p I_0(h_i(x))

    obtained by “softening” or “weakening” the penalty functions:

    \begin{aligned} I_-(t) & \to ct, \quad c>0\\ I_0(t) & \to bt, \quad b \in \mathbb{R}. \end{aligned}



    In particular, for each (\lambda,\nu) \in \mathbb{R}_+^m \times \mathbb{R}^p , the problem

    \begin{cases} \text{minimize } L(x,\lambda,\nu) \end{cases}

    has optimal value g(\lambda,\nu) and provides an underestimation of the original (OP).
    Example 1 Consider the least squares problem

    (LS) \begin{cases} \text{minimize} & x^Tx\\ \text{subject to}& Ax=b \end{cases}

    for given

    A = \begin{bmatrix} a_1^T \\ \vdots \\ a_p^T \end{bmatrix} \in \mathbb{R}^{p\times n}, \quad a_i \in \mathbb{R}^n, \quad b \in \mathbb{R}^p .

    N.B.:

    \begin{aligned} Ax = b \iff h_i(x) = a_i^Tx - b_i = 0 \text{ for } i=1,\ldots,p \end{aligned}.



    Therefore, the Lagrangian L for (LS) is

    \begin{aligned} L&:\mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\,L &= \mathbb{R}^n \times \mathbb{R}^p\\ L(x,\nu) &= x^Tx + \sum_{i=1}^p \nu_i (a_i^Tx - b_i) \\ &= x^Tx + \nu^T(Ax-b) \end{aligned}

    and the Lagrange dual is

    \begin{aligned} g&:\mathbb{R}^p \to \mathbb{R}\\ g(\nu) &= \inf\{ x^Tx + \nu^T(Ax-b) : x \in \mathbb{R}^n \}. \end{aligned}



    N.B.:

    \begin{aligned} \nabla^2_x L(x,\nu) &= \nabla_x^2 ( x^Tx + \nu^T(Ax-b)) \\ &= 2Id_{n \times n}\\ & \succeq 0  \end{aligned}

    and so x\mapsto L(x,\nu) is convex.
    Consequently,

    \begin{aligned} L(x^\star,\nu) = \inf\{L(x,\nu):x \in \mathbb{R}^n \} = \text{min}\{L(x,\nu):x \in \mathbb{R}^n \} \end{aligned}

    iff

    \nabla_x L(x^\star,\nu) = 2x^\star + A^T\nu = 0 ,

    i.e., iff

    x^\star = -\frac{1}{2}A^T\nu.



    In conclusion,

    \begin{aligned} g(\nu) &= L(x^\star,\nu)\\ &= (x^\star)^Tx^\star + \nu^T(Ax^\star-b)\\ &= \left(-\frac{1}{2}A^T\nu\right)^T\left(-\frac{1}{2}A^T\nu\right) +\nu^T\left(A\left(-\frac{1}{2}A^T\nu\right) - b\right)\\ &=\frac{1}{4}\nu^TAA^T\nu - \frac{1}{2}\nu^TAA^T\nu - \nu^Tb\\ &=-\frac{1}{4}\nu^TAA^T\nu - \nu^Tb. \end{aligned}

    In particular,

    -\frac{1}{4}\nu^TAA^T\nu - b^T\nu \leq \inf\{x^Tx:Ax=b\}.

    for all \nu \in \mathbb{R}^p.
    Example 2 Consider the linear program

    \text{(LP)} \begin{cases} \text{minimize} & c^Tx\\ \text{subject to}& Ax=b\\ & x\succeq 0 \end{cases}

    for given

    c \in \mathbb{R}^n, \quad A \in \mathbb{R}^{p \times n}, \quad b \in \mathbb{R}^p .

    N.B.:
    • equality constraints given by

      h_i(x) = a_i^Tx-b_i =0, \quad i=1,\ldots,p.

    • x\succeq 0 iff

      x_i \geq 0, \quad i=1,\ldots,n\quad

      iff

      \quad f_i(x) = -x_i \leq 0,\quad i=1,\ldots,n.



    Therefore, the Lagrangian for (LP) is

    \begin{aligned} L&: \mathbb{R}^n \times \mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\, L &=\mathbb{R}^n \times \mathbb{R}^n \times \mathbb{R}^p \\ L(x,\lambda,\nu) &= c^Tx - \sum_{i=1}^n \lambda_i x_i + \sum_{i=1}^p \nu_i(a_i^Tx-b_i)\\ &= c^Tx - \lambda^Tx + \nu^T(Ax-b)\\ &=(c - \lambda + A\nu^T)^Tx - \nu^Tb \end{aligned}



    Want to compute

    g(\lambda,\nu) = \inf\{ L(x,\lambda,\nu) : x \in \mathbb{R}^n \},

    but

    x \mapsto (c - \lambda + A\nu^T)^Tx - \nu^Tb

    is an affine function with domain \mathbb{R}^n .
    Therefore,

    x \mapsto (c - \lambda + A\nu^T)^Tx - \nu^Tb

    is bounded below iff (\lambda,\nu) satisfy

    c - \lambda + A\nu^T = 0 .



    Therefore, the Lagrange dual is

    \begin{aligned} g&:\mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\, g &= \{(\lambda,\nu) \in \mathbb{R}^n \times \mathbb{R}^n \times \mathbb{R}^p : c - \lambda + A\nu^T = 0\}\\ g(\lambda,\nu) &= \begin{cases} -b^T\nu & c - \lambda + A\nu^T = 0\\ -\infty & \text{else} \end{cases}. \end{aligned}



    In particular, for dual feasible (\lambda,\nu), there holds

    -b^T\nu \leq c^Tx

    for all feasible x .
    Return of Conjugate Function Recall: given f:\mathbb{R}^n \to \mathbb{R}, its conjugate function f^* is the convex function

    f^*(y) = \sup\{y^Tx - f(x) : x \in \text{dom}\,f\}.

    Interestingly: the conjugate function is related to the Lagrange dual. Example.
    Consider the the (OP)

    \text{(OP)} \begin{cases} \text{minimize}&f_0(x)\\ \text{subject to}&Ax \preceq b\\ &Cx=d \end{cases},

    for given

    \begin{aligned} f_0:\mathbb{R}^n \to \mathbb{R},\quad A \in \mathbb{R}^{m \times n}, \quad b \in \mathbb{R}^m\\ C \in \mathbb{R}^{p \times n}, \quad d \in \mathbb{R}^p. \end{aligned}

    We may write

    \begin{aligned} Ax\preceq b &\iff f_i(x) = a_i^Tx - b_i \leq0 , \quad i =1,\ldots, m\\ Cx = d &\iff h_i(x) =c_i^Tx - d_i=0, \quad i=1,\ldots,p \end{aligned}

    for suitable

    a_i ,c_i \in \mathbb{R}^n.



    The Lagrangian is

    \begin{aligned} L&:\mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R}\\ \text{dom}\, L &= D\\ L(x,\lambda,\nu) &= f_0(x) + \sum_{i=1}^m \lambda_i(a_i^Tx - b_i) + \sum_{i=1}^p \nu_i(c_i^Tx - d_i)\\ &= f_0(x) + \lambda^T(Ax-b) + \nu^T(Cx-d)\\ &= f_0(x) + \lambda^TAx + \nu^TCx - \lambda^Tb - \nu^Td \end{aligned}



    We may now compute the Lagrange dual in terms of the conjugate f_0^* :

    \begin{aligned} g&:\mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R}\\ g(\lambda,\nu)&= \inf\{L(x,\lambda,\nu):x \in D\}\\ &=\inf\{f_0(x) + \lambda^TAx + \nu^TCx - \lambda^Tb - \nu^Td : x \in D\}\\ &=\inf\{ (A^T\lambda + C^T\nu)^Tx + f_0(x):x \in D\} - \lambda^Tb - \nu^Td \\ &=-\sup\{-(A^T\lambda+C^T\nu)^Tx - f_0(x) : x \in D\} - \lambda^Tb - \nu^Td\\ &=-f_0^*(-A^T\lambda-C^T\nu) - \lambda^Tb - \nu^Td. \end{aligned}

    Since

    \begin{aligned} g(\lambda,\nu)>-\infty \iff f_0^*(-A^T\lambda-C^T\nu)< +\infty \end{aligned},

    we conclude

    \begin{aligned}\text{dom}\,g = \{(\lambda,\nu)\in \mathbb{R}^m\times\mathbb{R}^p: -A^T\lambda-C^T\nu \in \text{dom}\,f_0^* \} \end{aligned}.

    Appendix
    Differentiating b^Tx Given

    b = \begin{bmatrix}b_1\\b_2\\\vdots\\b_n\end{bmatrix} \in \mathbb{R}^n ,

    define the scalar function

    g(x) = b^Tx = b_1x_1 + \cdots + b_n x_n .

    Using the Taylor expansion

    \begin{aligned} g(y) = g(x) + (y-x)^T \nabla g(x) + \cdots \end{aligned}

    at y = 0, we find

    \begin{aligned} 0 = b^Tx - x^T \nabla g(x) + \cdots \end{aligned}

    and so

    x^T \nabla g(x) = b^T.

    To see this computed directly, computing

    \begin{aligned}  \frac{\partial}{\partial x_k} g(x) &= \frac{\partial}{\partial x_k}(b_1x_1 + \cdots + b_nx_n)\\ &=b_k, \end{aligned}

    we conclude

    \nabla g = \begin{bmatrix}\frac{\partial}{\partial x_1} g(x) \\ \frac{\partial}{\partial x_2}g(x) \\ \vdots \\ \frac{\partial}{\partial x_n}g(x) \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix} = b.

    Differentiating \frac{1}{2}x^TQx Given

    Q  = \begin{bmatrix} q_{11} & q_{12} & \cdots & q_{1n}\\ q_{21} & q_{22} & \cdots & q_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ q_{n1} & q_{n2} & \cdots & q_{nn} \end{bmatrix} \in \boldsymbol{S}^n

    define the scalar function

    f(x) = \frac{1}{2}x^T Q x.

    Using the Taylor expansion

    \begin{aligned} f(y) = f(x) + (y-x)^T \nabla f(x) + \frac{1}{2} (y-x)^T \nabla^2 f(x) (y-x) + \cdots \end{aligned}

    at y = 0 , we find

    \begin{aligned} 0 &= \frac{1}{2}x^TQx - x^T \nabla f(x) + \frac{1}{2} x^T \nabla^2 f(x)x + \cdots\\ &= x^TQx - x^T \nabla f(x) - \frac{1}{2}x^TQx + \frac{1}{2}x^T\nabla^2 f(x) x + \cdots, \end{aligned}

    which is evidently satisfied in case

    \begin{aligned} \nabla f(x) &= Qx\\ \nabla^2 f(x) &= Q. \end{aligned}

    To see this computed directly: expanding the matrix multiplication, we have

    \begin{aligned} f(x) = \frac{1}{2} \sum_{i,j=1}^{n} x_i x_j q_{ij}  \end{aligned}  .

    Computing

    \begin{aligned}  \frac{\partial}{\partial x_k} f(x) &= \frac{1}{2}\sum_{i,j=1}^n \frac{\partial}{\partial x_k} (x_i x_j q_{ij})\\ &=\frac{1}{2} \left(\sum_{i=1}^n x_i q_{ik} + \sum_{j=1}^n x_j q_{kj} \right)\\ &=\frac{1}{2}\sum_{i=1}^n x_i(q_{ik}+q_{ki})\\ &=\sum_{i=1}^n x_iq_{ik} \end{aligned}

    we have

    \begin{aligned} \nabla f(x) &=  \begin{bmatrix} \frac{\partial}{\partial x_1} f(x)\\ \frac{\partial}{\partial x_2} f(x)\\ \vdots\\ \frac{\partial}{\partial x_n} f(x) \end{bmatrix} =Qx \end{aligned} .

    Taking the second derivative gives

    \begin{aligned}  \frac{\partial^2}{\partial x_k \partial x_l} f(x) &= \sum_{i=1}^n\frac{\partial}{\partial x_l}x_i q_{ik}\\ &=q_{lk}. \end{aligned}

    Consequently, there holds

    \begin{aligned} \nabla^2 f(x) &=  \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(x) \\ \frac{\partial^2 f}{\partial x_2 \partial x_1 }(x) & \frac{\partial^2 f}{\partial x_2^2 }(x) & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(x) \\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n \partial x_1}(x) & \frac{\partial^2 f}{\partial x_n \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x) \\ \end{bmatrix}\\ & = Q. \end{aligned}

    Example. Let

    \begin{aligned} f(x_1,x_2) &= \frac{1}{2}\begin{bmatrix}x_1 & x_2 \end{bmatrix} \begin{bmatrix} a&b\\c&d \end{bmatrix} \begin{bmatrix}x_1\\x_2 \end{bmatrix}\\ &=\frac{1}{2} \begin{bmatrix}x_1&x_2\end{bmatrix}\begin{bmatrix}ax_1+bx_2\\cx_1+dx_2\end{bmatrix}\\ &=\frac{1}{2}(ax_1^2 + (b+c)x_1x_2 + dx_2^2). \end{aligned}

    Compute

    \begin{aligned} \frac{\partial}{\partial x_1} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_1} (ax_1^2 + (b+c)x_1x_2 + dx_2^2)\\ &=\frac{1}{2}(2ax_1 + (b+c)x_2)\\ \frac{\partial}{\partial x_2} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_2} (ax_1^2 + (b+c)x_1x_2 + dx_2^2)\\ &=\frac{1}{2}((b+c)x_1 + 2dx_2)\\ \end{aligned}.

    Consequently,

    \begin{aligned} \nabla f(x) &= \begin{bmatrix} \frac{\partial}{\partial x_1} f(x) \\ \frac{\partial}{\partial x_2} f(x) \end{bmatrix}\\ &=\begin{bmatrix} \frac{1}{2}(2ax_1 + (b+c)x_2)\\ \frac{1}{2}((b+c)x_1 + 2dx_2) \end{bmatrix}\\ &=\frac{1}{2} \left(\begin{bmatrix}ax_1 + bx_2 \\ cx_1 + dx_2 \end{bmatrix} + \begin{bmatrix} ax_1 + cx_2 \\ bx_1 + dx_2 \end{bmatrix} \right)\\ &=\frac{1}{2}\left( \begin{bmatrix}a&b\\c&d\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix} + \begin{bmatrix}a&c\\b&d\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix}\right)\\ &=\frac{1}{2}(Q+Q^T)x. \end{aligned}

    Next compute the second derivatives:

    \begin{aligned} \frac{\partial^2}{\partial x_1^2} f(x_1,x_2) &=\frac{1}{2}\frac{\partial}{\partial x_1}(2ax_1 + (b+c)x_2)\\ &=a\\ \frac{\partial^2}{\partial x_1 \partial x_2} f(x_1,x_2) &=\frac{1}{2}\frac{\partial}{\partial x_2}(2ax_1 + (b+c)x_2)\\ &=\frac{1}{2}(b+c)\\ \frac{\partial^2}{\partial x_2 \partial x_1} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_1}((b+c)x_1 + 2dx_2)\\ &=\frac{1}{2}(b+c)\\ &=\frac{1}{2}(b+c)\\ \frac{\partial^2}{\partial x_2^2} f(x_1,x_2) &= \frac{1}{2}\frac{\partial}{\partial x_2}((b+c)x_1 + 2dx_2)\\ &=d. \end{aligned}

    Putting this together gives

    \begin{aligned} \nabla^2 f(x) &= \begin{bmatrix} \frac{\partial^2}{\partial x_1^2}f & \frac{\partial^2}{\partial x_1 \partial x_2}f\\ \frac{\partial^2}{\partial x_2 \partial x_1}f & \frac{\partial^2}{\partial x_2^2}f \end{bmatrix}\\ &= \frac{1}{2}\begin{bmatrix} a & b+c\\ c+b & d \end{bmatrix}\\ &= \frac{1}{2}\begin{bmatrix} a & b\\ c& d \end{bmatrix} + \frac{1}{2}\begin{bmatrix}a&c\\b&d\end{bmatrix}\\ &= \frac{1}{2}(Q+Q^T). \end{aligned}

    Differentiating b^TAx Given A \in \mathbb{R}^{m \times n} , b \in \mathbb{R}^m , define the scalar function h(x) = b^TAx.
    Using the Taylor expansion

    h(y) = h(x) + (y-x)^T\nabla h(x) + \cdots

    at y = 0, we get

    \begin{aligned} 0 &= b^TAx - x^T \nabla h(x) + \cdots\\ &= x^T(Ab^T) - x^T \nabla h(x) + \cdots \end{aligned}

    which directly implies

    \nabla h(x) = Ab^T .

    To see this computed directly, first compute

    \begin{aligned} b^TAx &= \begin{bmatrix}b_1& \cdots & b_m \end{bmatrix} \begin{bmatrix} a_1^T\\ \vdots \\ a_m^T \end{bmatrix} \begin{bmatrix}x_1\\\vdots\\x_n\end{bmatrix}\\ &=\begin{bmatrix}b_1& \cdots & b_m \end{bmatrix} \begin{bmatrix} a_1^Tx\\\vdots\\ a_m^T x\end{bmatrix}\\ &= b_1a_1^Tx + \cdots + b_ma_m^Tx. \end{aligned}

    Computing

    \begin{aligned} \frac{\partial}{\partial x_k} b^TAx &= \frac{\partial}{\partial x_k}(b_1a_1^Tx + \cdots + b_ma_m^Tx)\\ &= b_1 a_{1k} + \cdots + b_m a_{mk}, \end{aligned}

    we conclude

    \begin{aligned} \nabla b^TAx &= \begin{bmatrix}  b_1 a_{1k} + \cdots + b_m a_{mk}\\ \vdots\\ b_1 a_{1m} + \cdots + b_m a_{mm} \end{bmatrix}\\ &= A^Tb. \end{aligned}