MM Algorithm[unfinished]
MM algorithm
1. Overview of MM algorithm
- The MM algorithm is not an algorithm, but a prescription for constructing optimization algorithms.
- The EM algorithm from statistics is a special case.
- An MM algorithm operates by creating a surrogate function that minorizes or majorizes the objective function. When the surrogate function is optimized, the objective function is driven uphill or downhill as needed.
- In minimization MM stands for majorize/minimize, and in maximization MM stands for minorize/maximize.
2. Rationale for the MM principle
- It can generate an algorithm that avoids matrix inversion.
- It can separate the parameters of a problem.
- It can linearize an optimization problem.
- It can deal gracefully with equality and inequality constraints.
- It can turn a non-differentiable problem into a smooth problem.
3. Majorization and definition of algorithms
- A function $g(θ | θ^n)$ is said to majorize the function f(θ) at θn provided:
$f(θ^n) = g(θ^n | θ^n); $
$f(θ) ≤ g(θ | θ^n)$ for all θ.
- The majorization relation between functions is closed under the formation of sums, nonnegative products, limits,and composition with an increasing function.
- A function $g(θ | θ^n)$ is said to minorize the function $f(θ)$ at $θ^n$ provided $−g(θ | θ^n)$ majorizes $−f(θ)$.
- In minimization, we choose a majorizing function $g(θ | θ^n)$ and minimize it. This produces the next point $θ^{n+1}$ in the algorithm.
4. Descent Property
- An MM minimization algorithm satisfies the descent property $f(θ^{n+1}) ≤ f(θ^n)$ with strict inequality unless both
$g(θ^{n+1} | θ^n) = g(θ^n | θ^n)$
$f(θ^{n+1}) = g(θ^{n+1} | θ^n).$
- The descent property follows from the definitions and
$f(θ^{n+1}) = g(θ^{n+1} | θ^n) + f(θ^{n+1}) − g(θ^{n+1} | θ^n) $
$ ≤ g(θ^n | θ^n) + f(θ^n) − g(θ^n | θ^n)
= f(θ^n). $
(Because $f(θ^{n+1}) − g(θ^{n+1} | θ^n) \leq 0 = f(θ^n) − g(θ^n | θ^n) $)
(Because $f(θ^{n+1}) − g(θ^{n+1} | θ^n) \leq 0 = f(θ^n) − g(θ^n | θ^n) $)
- The descent property makes the MM algorithm very stable.
5. Constructing majorization (minorization) function
- Jensen's inequality: for any convex function $f(x)$ and any random variable $X$, we have:
$f(E(X)) \leq E(f(X))$.
Since $- ln(x)$ is a convex function, we conclude for probability densities $a(x), b(x)$ that
$-\ln {E(\frac{a(X)}{b(X)})} \leq - E[\ln(\frac{a(X)}{b(X)})]$.
If $X$ has density function $b(x)$, then $E(\frac{a(X)}{b(X)}) = 1$. The above inequality implies: $E( \ln (a(X))) \leq E( \ln (b(X)))$.
- Minorization via Supporting Hyperplanes
Any linear function tangent to the graph of a convex function is a minorizer at the point of tangency. Hence, if $f(x)$ is convex and differentiable, then:
$f(\theta) \geq f(\theta^{n}) + \triangledown f(\theta^n)^t(\theta - \theta^n)$
with equality when $\theta = \theta^n$.
- Majorization via the Definition of Convexity
A function $f(x)$ is a convex function if and only if:
$f(\sum\limits_{i}\alpha_it_i) \leq \sum\limits_{i} \alpha_if(t_i)$,
for any finite collection of points $t_i$ and corresponding multipliers $\alpha_i \geq 0$ and $\sum\limits_{i}\alpha_i = 1$.
- The Arithmetic-Geometric Mean Inequality
- The Cauchy-Schwartz Inequality
The function $f(\theta) = ||\theta||$ is convex because it satisfies the triangle inequality and the homogeneity condition $||\alpha \theta|| = |\alpha|||\theta||$.
From the inequality in section minorization via Supporting hyperplanes, we have:
$||\theta|| \geq ||\theta^{n}|| + \frac{(\theta - \theta^n)^t\theta^n}{||\theta^m||} = \frac{\theta^t\theta^m}{||\theta^m||}.$
6. Reference
Comments
Post a Comment