MM Algorithm[unfinished]

MM algorithm

1. Overview of MM algorithm


    • The MM algorithm is not an algorithm, but a prescription for constructing optimization algorithms. 
    • The EM algorithm from statistics is a special case.
    • An MM algorithm operates by creating a surrogate function that minorizes or majorizes the objective function. When the surrogate function is optimized, the objective function is driven uphill or downhill as needed.
    • In minimization MM stands for majorize/minimize, and in maximization MM stands for minorize/maximize.

2. Rationale for the MM principle


    • It can generate an algorithm that avoids matrix inversion. 
    • It can separate the parameters of a problem. 
    • It can linearize an optimization problem. 
    • It can deal gracefully with equality and inequality constraints. 
    • It can turn a non-differentiable problem into a smooth problem.

3. Majorization and definition of algorithms


    • A function $g(θ | θ^n)$ is said to majorize the function f(θ) at θn provided: 

 $f(θ^n) = g(θ^n | θ^n); $

$f(θ) ≤ g(θ | θ^n)$ for all θ. 
    • The majorization relation between functions is closed under the formation of sums, nonnegative products, limits,and composition with an increasing function.
    • A function $g(θ | θ^n)$ is said to minorize the function $f(θ)$ at $θ^n$ provided $−g(θ | θ^n)$ majorizes $−f(θ)$. 
    • In minimization, we choose a majorizing function $g(θ | θ^n)$ and minimize it. This produces the next point $θ^{n+1}$ in the algorithm.

 4. Descent Property 


  • An MM minimization algorithm satisfies the descent property $f(θ^{n+1}) ≤ f(θ^n)$ with strict inequality unless both 
 $g(θ^{n+1} | θ^n) = g(θ^n | θ^n)$
 $f(θ^{n+1}) = g(θ^{n+1} | θ^n).$ 
  • The descent property follows from the definitions and 
     $f(θ^{n+1}) = g(θ^{n+1} | θ^n) + f(θ^{n+1}) − g(θ^{n+1} | θ^n) $
     $ ≤ g(θ^n | θ^n) + f(θ^n) − g(θ^n | θ^n) = f(θ^n). $
(Because $f(θ^{n+1}) − g(θ^{n+1} | θ^n) \leq 0 = f(θ^n) − g(θ^n | θ^n) $)
  • The descent property makes the MM algorithm very stable.


5. Constructing majorization (minorization) function

  • Jensen's inequality: for any convex function $f(x)$ and any random variable $X$, we have:
$f(E(X)) \leq E(f(X))$.
      Since $- ln(x)$ is a convex function, we conclude for probability densities $a(x), b(x)$ that
                                 
          $-\ln {E(\frac{a(X)}{b(X)})} \leq - E[\ln(\frac{a(X)}{b(X)})]$.

      If $X$ has density function $b(x)$, then  $E(\frac{a(X)}{b(X)}) = 1$. The above inequality implies:  $E( \ln (a(X))) \leq E( \ln (b(X)))$. 

       Discussion: Why EM algorithm is a special case of MM algorithm?   
       Answer: here

  •  Minorization via Supporting Hyperplanes


     Any linear function tangent to the graph of a convex function is a minorizer at the point of tangency. Hence, if $f(x)$ is convex and differentiable, then:



$f(\theta) \geq f(\theta^{n}) + \triangledown f(\theta^n)^t(\theta - \theta^n)$


     with equality when $\theta = \theta^n$.

  • Majorization via the Definition of Convexity

      A function $f(x)$ is a convex function if and only if:
   $f(\sum\limits_{i}\alpha_it_i) \leq \sum\limits_{i} \alpha_if(t_i)$, 
    for any finite collection of points $t_i$ and corresponding multipliers $\alpha_i \geq 0$ and                  $\sum\limits_{i}\alpha_i = 1$.

  • The Arithmetic-Geometric Mean Inequality
      
  •  The Cauchy-Schwartz Inequality    
The function $f(\theta) = ||\theta||$ is convex because it satisfies the triangle inequality and the           homogeneity condition $||\alpha \theta|| = |\alpha|||\theta||$.
From the inequality in section minorization via Supporting hyperplanes, we have:

      $||\theta|| \geq ||\theta^{n}||  + \frac{(\theta - \theta^n)^t\theta^n}{||\theta^m||} = \frac{\theta^t\theta^m}{||\theta^m||}.$ 

6. Reference


Comments

Popular posts from this blog

Manifold Learning [unfinished]

Feature scaling

Find all pairs in array of integers whose sum is equal to a given number ?