MM Algorithm[unfinished]

By Unknown - December 18, 2015

MM algorithm

1. Overview of MM algorithm

The MM algorithm is not an algorithm, but a prescription for constructing optimization algorithms.
The EM algorithm from statistics is a special case.
An MM algorithm operates by creating a surrogate function that minorizes or majorizes the objective function. When the surrogate function is optimized, the objective function is driven uphill or downhill as needed.
In minimization MM stands for majorize/minimize, and in maximization MM stands for minorize/maximize.

2. Rationale for the MM principle

It can generate an algorithm that avoids matrix inversion.
It can separate the parameters of a problem.
It can linearize an optimization problem.
It can deal gracefully with equality and inequality constraints.
It can turn a non-differentiable problem into a smooth problem.

3. Majorization and definition of algorithms

A function $g(θ | θ^n)$ is said to majorize the function f(θ) at θn provided:

$f(θ^n) = g(θ^n | θ^n); $

$f(θ) ≤ g(θ | θ^n)$ for all θ.

The majorization relation between functions is closed under the formation of sums, nonnegative products, limits,and composition with an increasing function.

A function $g(θ | θ^n)$ is said to minorize the function $f(θ)$ at $θ^n$ provided $−g(θ | θ^n)$ majorizes $−f(θ)$.

In minimization, we choose a majorizing function $g(θ | θ^n)$ and minimize it. This produces the next point $θ^{n+1}$ in the algorithm.

4. Descent Property

An MM minimization algorithm satisfies the descent property $f(θ^{n+1}) ≤ f(θ^n)$ with strict inequality unless both

$g(θ^{n+1} | θ^n) = g(θ^n | θ^n)$

$f(θ^{n+1}) = g(θ^{n+1} | θ^n).$

The descent property follows from the definitions and

$f(θ^{n+1}) = g(θ^{n+1} | θ^n) + f(θ^{n+1}) − g(θ^{n+1} | θ^n) $

$ ≤ g(θ^n | θ^n) + f(θ^n) − g(θ^n | θ^n) = f(θ^n). $
(Because $f(θ^{n+1}) − g(θ^{n+1} | θ^n) \leq 0 = f(θ^n) − g(θ^n | θ^n) $)

The descent property makes the MM algorithm very stable.

5. Constructing majorization (minorization) function

Jensen's inequality: for any convex function $f(x)$ and any random variable $X$, we have:

$f(E(X)) \leq E(f(X))$.

Since $- ln(x)$ is a convex function, we conclude for probability densities $a(x), b(x)$ that

$-\ln {E(\frac{a(X)}{b(X)})} \leq - E[\ln(\frac{a(X)}{b(X)})]$.

If $X$ has density function $b(x)$, then $E(\frac{a(X)}{b(X)}) = 1$. The above inequality implies: $E( \ln (a(X))) \leq E( \ln (b(X)))$.

Discussion: Why EM algorithm is a special case of MM algorithm?
Answer: here

Minorization via Supporting Hyperplanes

Any linear function tangent to the graph of a convex function is a minorizer at the point of tangency. Hence, if $f(x)$ is convex and differentiable, then:

$f(\theta) \geq f(\theta^{n}) + \triangledown f(\theta^n)^t(\theta - \theta^n)$

with equality when $\theta = \theta^n$.

Majorization via the Definition of Convexity

A function $f(x)$ is a convex function if and only if:

$f(\sum\limits_{i}\alpha_it_i) \leq \sum\limits_{i} \alpha_if(t_i)$,

for any finite collection of points $t_i$ and corresponding multipliers $\alpha_i \geq 0$ and $\sum\limits_{i}\alpha_i = 1$.

The Arithmetic-Geometric Mean Inequality

The Cauchy-Schwartz Inequality

The function $f(\theta) = ||\theta||$ is convex because it satisfies the triangle inequality and the homogeneity condition $||\alpha \theta|| = |\alpha|||\theta||$.

From the inequality in section minorization via Supporting hyperplanes, we have:

$||\theta|| \geq ||\theta^{n}|| + \frac{(\theta - \theta^n)^t\theta^n}{||\theta^m||} = \frac{\theta^t\theta^m}{||\theta^m||}.$

6. Reference

Search This Blog

Computer Science Enthusiasts in SIS-SMU