Batch and Online Mixture Learning: A Review with Extensions

Saint-Jean, Christophe; Nielsen, Frank

doi:10.1007/978-3-319-47058-0_11

Christophe Saint-Jean⁴ &
Frank Nielsen^5,6

Part of the book series: Signals and Communication Technology ((SCT))

1747 Accesses

Abstract

This paper addresses the problem of learning online finite statistical mixtures of regular exponential families. We first start by reviewing concisely the gradient-based and stochastic gradient-based optimization methods and their generalizations. We then focuses on two stochastic versions of the celebrated Expectation-Maximization (EM) algorithm: Titterington’s second-order stochastic gradient EM and Cappé and Moulines’ online EM. Depending on which step of EM is approximated, the possible constraints on the mixture parameters may be violated. A justification of these approaches as well as ready-to-use formulas for mixtures of regular exponential families are provided. Finally, to illustrate our study, some experimental comparisons on univariate normal mixtures are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
It is equivalent to an exponentially decaying moving average of past gradients.
2.
When $(\nabla F)^{-1}$ is computed with numerical approximations, this may give a different result.

References

Amari, S. (1997). Neural learning in structured parameter spaces — Natural Riemannian gradient. Neural Information Processing Society (NIPS), 9, 127–133.
Google Scholar
Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.
Article Google Scholar
Amari, S. (2016). Information geometry and its applications. Applied Mathematical Sciences. Japan: Springer.
Google Scholar
Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6, 1705–1749.
MathSciNet MATH Google Scholar
Bogdan, K., & Bogdan, M. (2000). On existence of maximum likelihood estimators in exponential families. Statistics, 34(2), 137–149.
Article MathSciNet MATH Google Scholar
Bottou, L. (1998). Online algorithms and stochastic approximations. In S. David (Ed.), Online learning and neural networks. Cambridge: Cambridge University Press.
Google Scholar
Bottou, L., & Bousquet, O. (2011). In S. Sra, S. Nowozin, & S. J. Wright (Eds.), The tradeoffs of large scale learning (pp. 351–368). Cambridge: MIT Press.
Google Scholar
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Cappé, O., & Moulines, E. (2009). On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society. Series B (Methodological), 71(3), 593–613.
Article MathSciNet MATH Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39, 1–38.
MathSciNet MATH Google Scholar
Liu, Q., & Ihler, A. T. (2014). Distributed estimation, information loss and exponential families. Advances in Neural Information Processing Systems, 27, 1098–1106.
Google Scholar
Miura, K. (2011). An introduction to maximum likelihood estimation in information geometry. Interdisciplinary Information Sciences, 17(3), 155–174.
Article MathSciNet MATH Google Scholar
Neal, R. M., & Hinton, G. E. (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Cambridge: MIT Press.
Google Scholar
Nielsen, F., & Garcia, V. (2009). Statistical exponential families: A digest with flash cards. arXiv:0911.4863.
Petersen, K. B., & Pedersen, M. S. (2012). The matrix cookbook. http://www2.imm.dtu.dk/pubdb/p.php?3274.
Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.
Article MathSciNet MATH Google Scholar
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
Article MathSciNet MATH Google Scholar
Saint-Jean, C., & Nielsen, F. (2014). Hartigan’s method for $k$-MLE: Mixture modeling with Wishart distributions and its application to motion retrieval. Geometric theory of information (pp. 301–330). New York: Springer.
Chapter Google Scholar
Sculley, D. (2010). Web-scale $k$-means clustering. In Proceedings of the 19th International Conference on World Wide Web (pp. 1177–1178).
Google Scholar
Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and Trends Machine Learning, 4(2), 107–194.
Article MATH Google Scholar
Titterington, D. M. (1984). Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society. Series B (Methodological), 46(2), 257–267.
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Mathématiques, Image, Applications (MIA), Université de La Rochelle, La Rochelle, France
Christophe Saint-Jean
LIX, École Polytechnique, Palaiseau, France
Frank Nielsen
Sony CSL, Tokyo, Japan
Frank Nielsen

Authors

Christophe Saint-Jean
View author publications
You can also search for this author in PubMed Google Scholar
Frank Nielsen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christophe Saint-Jean .

Editor information

Editors and Affiliations

Laboratoire d’Informatique (LIX),Palaiseau Cedex,France and Sony Computer Science Laboratories, Inc., Tokyo, Japan., Ecole Polytechnique, Palaiseau Cedex, France
Frank Nielsen
School of Mathematics and Statistics, The Open University, Milton Keynes, United Kingdom
Frank Critchley
Department of Mathematics, University of Manchester, Manchester, United Kingdom
Christopher T. J. Dodson

Appendices

Univariate Gaussian Distribution as an Exponential Family

Canonical Decomposition and ${\varvec{F}}$

$$\begin{aligned} f(x;\mu ,\sigma ^{2})&= \frac{1}{(2 \pi \sigma ^{2})^{1/2}}\exp \left\{ -\frac{(x - \mu )^{2}}{2\sigma ^{2}} \right\} \\&= \exp \left\{ -\frac{1}{2\sigma ^{2}} (x^{2} - 2 x \mu + \mu ^{2}) - \frac{1}{2} \log \left( 2 \pi \sigma ^{2}\right) \right\} \\&= \exp \left\{ \langle \frac{1}{2\sigma ^{2}}, -x^{2} \rangle + \langle \frac{\mu }{\sigma ^{2}}, x \rangle - \frac{\mu ^{2}}{2\sigma ^{2}} - \frac{1}{2} \log \left( 2 \pi \sigma ^{2}\right) \right\} \\ \end{aligned}$$

In the sequel, the vector of source parameters is denoted $\lambda =(\mu , \sigma ^2)$. One may recognize the canonical form of an exponential family

$$f(x;\theta ) = \exp \left\{ <\theta ,s(x)> +\, k(x) - F(\theta )\right\} $$

by setting $\theta = (\theta _1,\theta _2)$ with

$$\begin{aligned} \theta _{1}&= \frac{\mu }{\sigma ^{2}} \Longleftrightarrow \mu = \frac{\theta _{1}}{2\theta _{2}}\end{aligned}$$

(52)

$$\begin{aligned} \theta _{2}&= \frac{1}{2\sigma ^{2}} \Longleftrightarrow \sigma ^{2} = \frac{1}{2\theta _{2}} \end{aligned}$$

(53)

$$\begin{aligned} s(x)&=(x,-x^{2}) \end{aligned}$$

(54)

$$\begin{aligned} k(x)&= 0 \end{aligned}$$

(55)

$$\begin{aligned} f(x; \theta _{1}, \theta _{2})&= \exp \left\{ \langle \theta _{2}, -x^{2} \rangle + \langle \theta _{1}, x \rangle - \frac{1}{2} \frac{(\theta _{1}/2\theta _{2})^{2}}{1/2\theta _{2}} - \frac{1}{2} \log (2\pi /2\theta _{2})\right\} \\&= \exp \left\{ \langle \theta _{2}, -x^{2} \rangle + \langle \theta _{1}, x \rangle - \frac{\theta _{1}^{2}}{4\theta _{2}} - \frac{1}{2} \log (\pi ) + \frac{1}{2} \log \theta _{2}\right\} \end{aligned}$$

with the log normalizer F as

$$\begin{aligned} F(\theta _{1}, \theta _{2}) = \frac{\theta _{1}^{2}}{4\theta _{2}} + \frac{1}{2} \log (\pi ) - \frac{1}{2} \log \theta _{2} \end{aligned}$$

(56)

1.1 Gradient of the Log-Normalizer

The gradient of the log-normalizer is given by:

$$\begin{aligned} \frac{\partial F}{\partial \theta _{1}}(\theta _{1}, \theta _{2})&= \frac{\theta _{1}}{2\theta _{2}} \end{aligned}$$

(57)

$$\begin{aligned} \frac{\partial F}{\partial \theta _{2}}(\theta _{1},\theta _{2})&= -\frac{\theta _{1}^{2}}{4\theta _{2}^{2}} - \frac{1}{2\theta _{2}} \end{aligned}$$

(58)

In order to get the dual coordinate system $\eta =(\eta _{1}, \eta _{2})$, the following set of equations has to be inverted:

$$\begin{aligned} \eta _{1}&= \frac{\theta _{1}}{2\theta _{2}} \end{aligned}$$

(59)

$$\begin{aligned} \eta _{2}&= -\frac{\theta _{1}^{2}}{4\theta _{2}^{2}} - \frac{1}{2\theta _{2}} \end{aligned}$$

(60)

By plugging the first equation into the second one, it follows:

$$\begin{aligned} \eta _{2} = - \eta _{1}^{2} - \frac{1}{2\theta _{2}} \Longleftrightarrow&\theta _{2} = -\frac{1}{2(\eta _{1}^{2} + \eta _{2})}&= \frac{\partial F^*}{\partial \eta _{2}}(\eta _{1},\eta _{2}) \end{aligned}$$

(61)

$$\begin{aligned}&\theta _{1} = 2 \theta _{2} \eta _{1} = - \frac{\eta _{1}}{(\eta _{1}^{2} + \eta _{2})}&= \frac{\partial F^*}{\partial \eta _{1}}(\eta _{1},\eta _{2}) \end{aligned}$$

(62)

Formulas are even simpler regarding the source parameters since we know that

$$\begin{aligned} \eta _{1} = \mathbb {E}[X] = \mu\Longleftrightarrow & {} \mu = \eta _{1} \end{aligned}$$

(63)

$$\begin{aligned} \eta _{2} = \mathbb {E}[-X^2] = -\left\{ \mu ^2 + \sigma ^2\right\}\Longleftrightarrow & {} \sigma ^2 = - \left\{ \eta _{1}^2 + \eta _{2}\right\} \end{aligned}$$

(64)

In order to compute $F^{*}$, we simply have to reuse our previous results in

$$F^{*}(H) = \langle (\nabla F)^{-1} (H), H \rangle - F ( (\nabla F)^{-1} (H))$$

and obtain the following expression

The hessians H(F), $H(F^*)$ of respectively F and $F^*$ are

$$\begin{aligned} H(F)(\theta _1, \theta _2) = \begin{pmatrix} \frac{1}{2 \theta _2} &{} -\frac{\theta _1}{2 \theta _2^2}\\ -\frac{\theta _1}{2 \theta _2^2} &{} \frac{\theta _1^2 + \theta _2}{2 \theta _2^3} \end{pmatrix} \end{aligned}$$

(65)

$$\begin{aligned} H(F^*)(\eta _1, \eta _2) = \begin{pmatrix} \frac{\eta _1^2 - \eta _2}{(\eta _1^2 + \eta _2)^2} &{} \frac{\eta _1}{(\eta _1^2 + \eta _2)^2}\\ \frac{\eta _1}{(\eta _1^2 + \eta _2)^2} &{} \frac{1}{2(\eta _1^2 + \eta _2)^2} \end{pmatrix} \end{aligned}$$

(66)

Since the univariate normal distribution is an exponential family, the Kullback–Leibler divergence is a Bregman divergence for $F^*$ on expectation parameters:

$$\begin{aligned} KL(\mathcal {N}(\mu _{p},\sigma ^2_{p}) || \mathcal {N}(\mu _{q},\sigma ^2_{q}))&= B_{F^*}(\eta _p : \eta _q) \\&= F^*(\eta _p) - F^*(\eta _q) - \langle \eta _p - \eta _q, \nabla F^* (\eta _q) \rangle \end{aligned}$$

After calculations, it follows:

$$\begin{aligned} B_F^*(\eta _p : \eta _q) = \frac{1}{2} \left( \log \left( \frac{\eta _{1_q}^{2} + \eta _{2_q}}{\eta _{1_p}^{2} + \eta _{2_p}}\right) + \frac{2(\eta _{1_p} - \eta _{1_q})\eta _{1_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} + \frac{\eta _{2_p} - \eta _{2_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} \right) \end{aligned}$$

(67)

A simple rewrite of it with the source parameters leads to the known closed form:

$$\begin{aligned} \frac{1}{2} \left( \log \left( \frac{\eta _{1_q}^{2} + \eta _{2_q}}{\eta _{1_p}^{2} + \eta _{2_p}}\right) + \frac{2(\eta _{1_p} - \eta _{1_q})\eta _{1_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} + \frac{\eta _{2_p} - \eta _{2_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} \right)&= \nonumber \\ \frac{1}{2} \left( \log \left( \frac{\eta _{1_q}^{2} + \eta _{2_q}}{\eta _{1_p}^{2} + \eta _{2_p}}\right) + \frac{(\eta _{1_p}^2 + \eta _{2_p}) - (\eta _{1_p}-\eta _{1_q})^2 - (\eta _{1_q}^2 + \eta _{2_q})}{(\eta _{1_q}^{2} + \eta _{2_q})} \right)&= \nonumber \\ \frac{1}{2} \left( \log \left( \frac{\sigma _q^{2}}{\sigma _p^{2}}\right) + \frac{\sigma _p^{2}}{\sigma _q^{2}} + \frac{(\mu _p-\mu _q)^2}{\sigma _q^{2}} -1 \right) \end{aligned}$$

(68)

The Fisher information matrix $I(\lambda )$ is obtained by computing the expectation of the product of Fisher score and its transposition:

$$\begin{aligned} I(\lambda )&\mathop {=}\limits ^{def} \mathbb {E}\left[ \nabla _\lambda \log f(x;\lambda ) . \nabla _\lambda \log f(x;\lambda )^T\right] \nonumber \\&= \mathbb {E}\left[ \begin{pmatrix} \frac{x-\mu }{\sigma ^2}\\ \frac{(x-\mu )^2 - \sigma ^2}{2\sigma ^4}\end{pmatrix}. \begin{pmatrix} \frac{x-\mu }{\sigma ^2} \frac{(x-\mu )^2 - \sigma ^2}{2\sigma ^4}\end{pmatrix}\right] \nonumber \\&=\begin{pmatrix} \frac{1}{\sigma ^2} &{} 0 \\ 0 &{} \frac{1}{2\sigma ^4}\end{pmatrix}. \end{aligned}$$

(69)

By change in coordinates or direct computation, the Fisher information matrix is also:

$$\begin{aligned} I(\theta ) = H(F)(\theta ) = \begin{pmatrix}\frac{1}{2\theta _2} &{} -\frac{\theta _1}{2\theta _2^2}\\ -\frac{\theta _1}{2\theta _2^2} &{} \frac{\theta _1^2 + \theta _2}{2\theta _2^3}\end{pmatrix} \text{ and } I(\eta ) = \frac{1}{(\eta _1^2 + \eta _2)^2} \begin{pmatrix} (\eta _1^2 - \eta _2) &{} \eta _1\\ \eta _1 &{} \frac{1}{2}\end{pmatrix} \end{aligned}$$

(70)

1.2 Multivariate Gaussian Distribution as an Exponential Family

Canonical Decomposition and ${\varvec{F}}$

$$\begin{aligned} f(x;\mu ,\varSigma )&= \frac{1}{(2 \pi )^{d / 2} |\varSigma |^{1/2}}\exp \left\{ -\frac{ {}^t (x - \mu ) \varSigma ^{-1} (x - \mu )}{2} \right\} \\&= \exp \left\{ -\frac{ {}^tx\varSigma ^{-1}x - {}^t\mu \varSigma ^{-1}x - {}^tx\varSigma ^{-1}\mu + {}^t\mu \varSigma ^{-1}\mu }{2} - \log \left( (2 \pi )^{d / 2} |\varSigma |^{1/2}\right) \right\} \\&= \exp \left\{ -\frac{tr({}^tx\varSigma ^{-1}x) - \langle {}^t\varSigma ^{-1} \mu , x \rangle -\langle x, \varSigma ^{-1}\mu \rangle + \langle {}^t\varSigma ^{-1}\mu , \varSigma \varSigma ^{-1} \mu \rangle }{2} - \log \left( \pi ^{d / 2} |2\varSigma |^{1/2}\right) \right\} \end{aligned}$$

Due to the cyclic property of the trace and to the symmetry of $\varSigma ^{-1}$, it follows:

$$\begin{aligned} f(x;\mu ,\varSigma )&= \exp \left\{ tr\left( ^t\left( \frac{1}{2}\varSigma ^{-1}\right) (-x{}^tx)\right) + \langle \varSigma ^{-1} \mu , x \rangle - \frac{1}{2} \langle \varSigma ^{-1}\mu , \varSigma \varSigma ^{-1} \mu \rangle - \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2\varSigma |\right\} \\&= \exp \left\{ \langle \frac{1}{2}\varSigma ^{-1}, -x{}^tx \rangle _{F} + \langle \varSigma ^{-1} \mu , x \rangle - \frac{1}{4} {}^t(\varSigma ^{-1}\mu ) 2\varSigma (\varSigma ^{-1}\mu ) - \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2\varSigma | \right\} \\ \end{aligned}$$

where $\langle \cdot , \cdot \rangle _{F}$ is the Frobenius scalar product. One may recognize the canonical form of an exponential family

$$f(x;\varTheta ) = \exp \left\{ <\varTheta ,t(x)> + k(x) - F(\varTheta )\right\} $$

by setting:

$$\varTheta = (\theta _{1}, \theta _2)$$

$$\begin{aligned} \theta _1&= \varSigma ^{-1}\mu \Longleftrightarrow \mu = \frac{1}{2}\theta _2^{-1} \theta _1\end{aligned}$$

(71)

$$\begin{aligned} \theta _2&= \frac{1}{2}\varSigma ^{-1} \Longleftrightarrow \varSigma = \frac{1}{2}\theta _2^{-1} \end{aligned}$$

(72)

$$\begin{aligned} t(x)&=(x,-x{}^tx)\end{aligned}$$

(73)

$$\begin{aligned} k(x)&= 0 \end{aligned}$$

(74)

$$\begin{aligned} f(x; \theta _1, \theta _2) = \exp \left\{ \langle \theta _2, -x{}^tx \rangle _{F} + \langle \theta _1, x \rangle - \frac{1}{4} {}^t\theta _1 \theta _2^{-1} \theta _1 - \frac{d}{2} \log (\pi ) + \frac{1}{2} \log |\theta _2| \right\} \nonumber \\ \end{aligned}$$

(75)

with the log normalizer F:

$$\begin{aligned} F(\theta _1, \theta _2) = \frac{1}{4} {}^t\theta _1 \theta _2^{-1} \theta _1 + \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |\theta _2| \end{aligned}$$

(76)

1.3 Gradient of the Log-Normalizer

By applying the following formulas from the matrix cookbook (Petersen and Pedersen 2012)

identity 57:: $$ \frac{\partial \log |X|}{\partial X} = ({}^tX)^{-1} = {}^t(X^{-1}) $$
identity 61:: $$\frac{\partial {}^ta X^{-1} b}{\partial X} = - {}^tX^{-1} a {}^tb X^{-1} $$
identity 81:: $$\frac{\partial {}^tx B x}{\partial x} = (B + {}^tB)x $$

the gradient of the log-normalizer is given by:

$$\begin{aligned} \frac{\partial F}{\partial \theta _1}(\theta _1,\theta _2)&= \frac{1}{4} (\theta _2^{-1}+ {}^{t}\theta _2^{-1}) \theta _1 = \frac{1}{2} \theta _2^{-1} \theta _1 \end{aligned}$$

(77)

$$\begin{aligned} \frac{\partial F}{\partial \theta _2}(\theta _1,\theta _2)&= - \frac{1}{4} {}^t\theta _2^{-1} \theta _1 {}^t\theta _1 \theta _2^{-1} - \frac{1}{2} {}^t\theta _2^{-1} = - \left( \frac{1}{2} \theta _2^{-1} \theta _1\right) ^t\left( \frac{1}{2} \theta _2^{-1} \theta _1\right) - \frac{1}{2} \theta _2^{-1} \end{aligned}$$

(78)

In order to emphasize the coherence of these formulas, recall that the gradient of the log-normalizer corresponds the expectation of the sufficient statistics:

$$\begin{aligned} \mathbb {E}[x]&= \mu\equiv & {} ~\frac{1}{2}\theta _2^{-1} \theta _1\end{aligned}$$

(79)

$$\begin{aligned} \mathbb {E}[-x{}^tx]&= -\mathbb {E}[x{}^tx] = -\mu {}^{t}\mu - \varSigma\equiv & {} - \left( \frac{1}{2}\theta _2^{-1} \theta _1\right) ^t\left( \frac{1}{2}\theta _2^{-1} \theta _1\right) - \frac{1}{2}\theta _2^{-1} \end{aligned}$$

(80)

Last equation comes from the expansion of $\mathbb {E}[(x - \mu ) {}^t(x - \mu )]$.

1.4 Convex Conjugate G of F and Its Gradient

In order to get the dual coordinate system $H=(\eta _1, \eta _2)$, the following set of equations has to be inverted:

$$\begin{aligned} \eta _1&=\frac{1}{2} \theta _2^{-1} \theta _1 \end{aligned}$$

(81)

$$\begin{aligned} \eta _2&= -\left( \frac{1}{2}\theta _2^{-1} \theta _1\right) ^t\left( \frac{1}{2}\theta _2^{-1} \theta _1\right) - \frac{1}{2}\theta _2^{-1} \end{aligned}$$

(82)

By plugging the first equation into the second one, it follows

$$\begin{aligned} \eta _2 = - \eta _1 {}^t\eta _1 - \frac{1}{2}\theta _2^{-1} \Longleftrightarrow \theta _2= \frac{1}{2}(-\eta _1 {}^t\eta _1 -\eta _2)^{-1} = \frac{\partial G}{\partial \eta _2}(\eta _1,\eta _2) \end{aligned}$$

(83)

and

$$\begin{aligned} \theta _1 = 2 \theta _2\eta _1= (- \eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 = \frac{\partial G}{\partial \eta _1}(\eta _1,\eta _2) \end{aligned}$$

(84)

Formulas are even simpler regarding the source parameters since we know from Eqs. 79 and 80 that

$$\begin{aligned} \eta _1 = \mu\Longleftrightarrow & {} \mu = \eta _1 \end{aligned}$$

(85)

$$\begin{aligned} \eta _2= -\mu {}^{t}\mu - \varSigma\Longleftrightarrow & {} \varSigma = - \eta _1 {}^t\eta _1 - \eta _2 \end{aligned}$$

(86)

In order to compute $G := F^{*}$, we simply have to reuse our previous results in

$$G(H) = \langle (\nabla F)^{-1} (H), H \rangle - F ( (\nabla F)^{-1} (H))$$

and obtain the following expression

$$\begin{aligned} G(\eta _1, \eta _2)&= \langle (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1, \eta _1 \rangle + \langle \frac{1}{2} (- \eta _1 {}^t\eta _1 - \eta _2)^{-1}, \eta _2 \rangle _{F}\\&- \frac{1}{4} {}^t((-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1) 2(-\eta _1 {}^t\eta _1 - \eta _2) (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 \\&- \frac{d}{2} \log (\pi ) + \frac{1}{2} \log |\frac{1}{2} (-\eta _1 {}^t\eta _1 - \eta _2)^{-1}|\\&= {}^t \eta _1 (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 +\frac{1}{2} tr({}^{t}(-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _2)\\&- \frac{1}{2} {}^t\eta _1{}^t(-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1\\&- \frac{d}{2} \log (\pi ) + \frac{1}{2} \log |(2(-\eta _1 {}^t\eta _1 - \eta _2))^{-1}|\\&= \frac{1}{2} {}^t \eta _1 (- \eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 +\frac{1}{2} tr((-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _2)\\&- \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= \frac{1}{2} \left( tr((- \eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 {}^t \eta _1 ) +tr((-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _2)\right) \\&- \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= - \frac{1}{2} tr((- \eta _1 {}^t\eta _1 - \eta _2)^{-1} (- \eta _1 {}^t \eta _1 - \eta _2)) - \frac{d}{2} \log (\pi )\nonumber \\&- \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= - \frac{1}{2} tr(I_{d}) - \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= - \frac{d}{2} \log (e\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\ \end{aligned}$$

Let us rewrite this expression with source parameters:

$$\begin{aligned} G(\mu , \varSigma ) = - \frac{d}{2} \log (e\pi ) - \frac{1}{2} \log |2\varSigma |\ \end{aligned}$$

(87)

1.5 Kullback–Leibler Divergence

First recall that the Kullback–Leibler divergence between two p.d.f. p and q is

$$ KL(p || q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$

For two multivariate normal distributions, it is known in closed form

$$\begin{aligned} KL(\mathcal {N}(\mu _{p},\varSigma _{p}) || \mathcal {N}(\mu _{q},\varSigma _{q})) = \frac{1}{2}\left( \log \left( \frac{|\varSigma _{q}|}{|\varSigma _{p}|}\right) + tr(\varSigma _{q}^{-1}\varSigma _{p}) + {}^{t}(\mu _{q}-\mu _{p})\varSigma _{q}^{-1}(\mu _{q}-\mu _{p}) - d\right) \end{aligned}$$

(88)

Since the multivariate normal distribution is an E.F., the same result must be obtained using the bregman divergence for G on expectation parameters $H_{p}$ and $H_{q}$:

$$KL(\mathcal {N}(\mu _{p},\varSigma _{p}) || \mathcal {N}(\mu _{q},\varSigma _{q})) = B_G(H_p || H_q) = G(H_{p}) - G(H_{q}) - \langle H_{p} - H_{q}, \nabla G (H_{q}) \rangle $$

$$\begin{aligned} G(H_{p}) - G(H_{q})&= - \frac{d}{2} \log (e\pi ) - \frac{1}{2} \log |-2(\eta _{1_{p}} {}^t\eta _{1_{p}} + \eta _{2_{p}})| \\&+ \frac{d}{2} \log (e\pi ) + \frac{1}{2} \log |-2(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})| \\&= \frac{1}{2} \log \frac{|-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})|}{|-(\eta _{1_{p}} {}^t\eta _{1_{p}} + \eta _{2_{p}})|}\\ - \langle H_{p} - H_{q}, \nabla G (H_{q}) \rangle&= - \langle \eta _{1_{p}} - \eta _{1_{q}}, - (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}} \rangle \\&- tr\left( ^{t} (\eta _{2_{p}} - \eta _{2_{q}}) \left( -\frac{1}{2} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}\right) \right) \\&= {}^t \eta _{1_{p}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}} - {}^t \eta _{1_{q}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}} \\&- \frac{1}{2} tr({}^{t} \eta _{2_{p}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1})) + \frac{1}{2} tr({}^{t}\eta _{2_{q}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}))\\ \end{aligned}$$

In order to go further, we can express these two formulas using $\mu $ and $\varSigma ^{-1} = (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} = -(\eta _1 {}^t\eta _1 + \eta _2)^{-1} $ (cf. Eq. 86):

$$\begin{aligned} \frac{1}{2} \log \frac{|-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})|}{|-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})|}&= \frac{1}{2} \log \frac{|\varSigma _q|}{|\varSigma _p|} \end{aligned}$$

$$\begin{aligned} {}^t \eta _{1_{p}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}}&= -{}^t \mu _{p} \varSigma _{q}^{-1} \mu _{q}\\ - {}^t \eta _{1_{q}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}}&= {}^t \mu _{q} \varSigma _{q}^{-1} \mu _{q} \end{aligned}$$

$$\begin{aligned} - \frac{1}{2} tr({}^{t} \eta _{2_{p}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}))&= \frac{1}{2} tr((\mu _{p}{}^{t}\mu _{p} + \varSigma _{p}) \varSigma _{q}^{-1})\\&= \frac{1}{2} tr(\mu _{p}{}^{t}\mu _{p}\varSigma _{q}^{-1}) + \frac{1}{2} tr(\varSigma _{p}\varSigma _{q}^{-1})\\&= \frac{1}{2} {}^{t}\mu _{p}\varSigma _{q}^{-1}\mu _{p} + \frac{1}{2} tr(\varSigma _{q}^{-1}\varSigma _{p})\\ + \frac{1}{2} tr({}^{t}\eta _{2_{q}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}))&= - \frac{1}{2} tr((\mu _{q}{}^{t}\mu _{q} + \varSigma _{q}) \varSigma _{q}^{-1})\\&= - \frac{1}{2} tr(\mu _{q}{}^{t}\mu _{q}\varSigma _{q}^{-1}) - \frac{1}{2} tr(\varSigma _{q}\varSigma _{q}^{-1})\\&= - \frac{1}{2} {}^{t}\mu _{q}\varSigma _{q}^{-1}\mu _{q} - \frac{1}{2} d \end{aligned}$$

By summing up of these terms, the standard formula for KL divergence is recovered:

$$\begin{aligned} KL(\mathcal {N}(\mu _{p},\varSigma _{p})&|| \mathcal {N}(\mu _{q},\varSigma _{q})) = \frac{1}{2} \log \frac{|\varSigma _q|}{|\varSigma _p|} -{}^t \mu _{p} \varSigma _{q}^{-1} \mu _{q} +{}^t \mu _{q} \varSigma _{q}^{-1} \mu _{q} + \\&\frac{1}{2} {}^{t}\mu _{p}\varSigma _{q}^{-1}\mu _{p} + \frac{1}{2} tr(\varSigma _{q}^{-1}\varSigma _{p}) - \frac{1}{2} {}^{t}\mu _{q}\varSigma _{q}^{-1}\mu _{q} - \frac{1}{2} d\\ =&\frac{1}{2} \left( \log \frac{|\varSigma _q|}{|\varSigma _p|} + tr(\varSigma _{q}^{-1}\varSigma _{p}) - d~-\right. \\&\left. \left\{ 2{}^t \mu _{p} \varSigma _{q}^{-1} \mu _{q} - 2 {}^t \mu _{q} \varSigma _{q}^{-1} \mu _{q} - ^{t}\mu _{p}\varSigma _{q}^{-1}\mu _{p} + {}^{t}\mu _{q}\varSigma _{q}^{-1}\mu _{q}\right\} \right) \\ =&\frac{1}{2}\left( \log \frac{|\varSigma _q|}{|\varSigma _p|} + tr(\varSigma _{q}^{-1}\varSigma _{p}) - {}^t (\mu _{p} - \mu _{q}) \varSigma _{q}^{-1} (\mu _{p} - \mu _{q}) - d \right) \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Saint-Jean, C., Nielsen, F. (2017). Batch and Online Mixture Learning: A Review with Extensions. In: Nielsen, F., Critchley, F., Dodson, C. (eds) Computational Information Geometry. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-47058-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-47058-0_11
Published: 24 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47056-6
Online ISBN: 978-3-319-47058-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Batch and Online Mixture Learning: A Review with Extensions

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendices

Univariate Gaussian Distribution as an Exponential Family

1.1 Gradient of the Log-Normalizer

1.2 Multivariate Gaussian Distribution as an Exponential Family

1.3 Gradient of the Log-Normalizer

1.4 Convex Conjugate G of F and Its Gradient

1.5 Kullback–Leibler Divergence

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation