Skip to main content

Batch and Online Mixture Learning: A Review with Extensions

  • Chapter
  • First Online:
Computational Information Geometry

Part of the book series: Signals and Communication Technology ((SCT))

  • 1747 Accesses

Abstract

This paper addresses the problem of learning online finite statistical mixtures of regular exponential families. We first start by reviewing concisely the gradient-based and stochastic gradient-based optimization methods and their generalizations. We then focuses on two stochastic versions of the celebrated Expectation-Maximization (EM) algorithm: Titterington’s second-order stochastic gradient EM and Cappé and Moulines’ online EM. Depending on which step of EM is approximated, the possible constraints on the mixture parameters may be violated. A justification of these approaches as well as ready-to-use formulas for mixtures of regular exponential families are provided. Finally, to illustrate our study, some experimental comparisons on univariate normal mixtures are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    It is equivalent to an exponentially decaying moving average of past gradients.

  2. 2.

    When \((\nabla F)^{-1}\) is computed with numerical approximations, this may give a different result.

References

  • Amari, S. (1997). Neural learning in structured parameter spaces — Natural Riemannian gradient. Neural Information Processing Society (NIPS), 9, 127–133.

    Google Scholar 

  • Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.

    Article  Google Scholar 

  • Amari, S. (2016). Information geometry and its applications. Applied Mathematical Sciences. Japan: Springer.

    Google Scholar 

  • Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6, 1705–1749.

    MathSciNet  MATH  Google Scholar 

  • Bogdan, K., & Bogdan, M. (2000). On existence of maximum likelihood estimators in exponential families. Statistics, 34(2), 137–149.

    Article  MathSciNet  MATH  Google Scholar 

  • Bottou, L. (1998). Online algorithms and stochastic approximations. In S. David (Ed.), Online learning and neural networks. Cambridge: Cambridge University Press.

    Google Scholar 

  • Bottou, L., & Bousquet, O. (2011). In S. Sra, S. Nowozin, & S. J. Wright (Eds.), The tradeoffs of large scale learning (pp. 351–368). Cambridge: MIT Press.

    Google Scholar 

  • Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Cappé, O., & Moulines, E. (2009). On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society. Series B (Methodological), 71(3), 593–613.

    Article  MathSciNet  MATH  Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39, 1–38.

    MathSciNet  MATH  Google Scholar 

  • Liu, Q., & Ihler, A. T. (2014). Distributed estimation, information loss and exponential families. Advances in Neural Information Processing Systems, 27, 1098–1106.

    Google Scholar 

  • Miura, K. (2011). An introduction to maximum likelihood estimation in information geometry. Interdisciplinary Information Sciences, 17(3), 155–174.

    Article  MathSciNet  MATH  Google Scholar 

  • Neal, R. M., & Hinton, G. E. (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Cambridge: MIT Press.

    Google Scholar 

  • Nielsen, F., & Garcia, V. (2009). Statistical exponential families: A digest with flash cards. arXiv:0911.4863.

  • Petersen, K. B., & Pedersen, M. S. (2012). The matrix cookbook. http://www2.imm.dtu.dk/pubdb/p.php?3274.

  • Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.

    Article  MathSciNet  MATH  Google Scholar 

  • Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.

    Article  MathSciNet  MATH  Google Scholar 

  • Saint-Jean, C., & Nielsen, F. (2014). Hartigan’s method for \(k\)-MLE: Mixture modeling with Wishart distributions and its application to motion retrieval. Geometric theory of information (pp. 301–330). New York: Springer.

    Chapter  Google Scholar 

  • Sculley, D. (2010). Web-scale \(k\)-means clustering. In Proceedings of the 19th International Conference on World Wide Web (pp. 1177–1178).

    Google Scholar 

  • Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and Trends Machine Learning, 4(2), 107–194.

    Article  MATH  Google Scholar 

  • Titterington, D. M. (1984). Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society. Series B (Methodological), 46(2), 257–267.

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christophe Saint-Jean .

Editor information

Editors and Affiliations

Appendices

Appendices

Univariate Gaussian Distribution as an Exponential Family

Canonical Decomposition and \({\varvec{F}}\)

$$\begin{aligned} f(x;\mu ,\sigma ^{2})&= \frac{1}{(2 \pi \sigma ^{2})^{1/2}}\exp \left\{ -\frac{(x - \mu )^{2}}{2\sigma ^{2}} \right\} \\&= \exp \left\{ -\frac{1}{2\sigma ^{2}} (x^{2} - 2 x \mu + \mu ^{2}) - \frac{1}{2} \log \left( 2 \pi \sigma ^{2}\right) \right\} \\&= \exp \left\{ \langle \frac{1}{2\sigma ^{2}}, -x^{2} \rangle + \langle \frac{\mu }{\sigma ^{2}}, x \rangle - \frac{\mu ^{2}}{2\sigma ^{2}} - \frac{1}{2} \log \left( 2 \pi \sigma ^{2}\right) \right\} \\ \end{aligned}$$

In the sequel, the vector of source parameters is denoted \(\lambda =(\mu , \sigma ^2)\). One may recognize the canonical form of an exponential family

$$f(x;\theta ) = \exp \left\{ <\theta ,s(x)> +\, k(x) - F(\theta )\right\} $$

by setting \(\theta = (\theta _1,\theta _2)\) with

$$\begin{aligned} \theta _{1}&= \frac{\mu }{\sigma ^{2}} \Longleftrightarrow \mu = \frac{\theta _{1}}{2\theta _{2}}\end{aligned}$$
(52)
$$\begin{aligned} \theta _{2}&= \frac{1}{2\sigma ^{2}} \Longleftrightarrow \sigma ^{2} = \frac{1}{2\theta _{2}} \end{aligned}$$
(53)
$$\begin{aligned} s(x)&=(x,-x^{2}) \end{aligned}$$
(54)
$$\begin{aligned} k(x)&= 0 \end{aligned}$$
(55)
$$\begin{aligned} f(x; \theta _{1}, \theta _{2})&= \exp \left\{ \langle \theta _{2}, -x^{2} \rangle + \langle \theta _{1}, x \rangle - \frac{1}{2} \frac{(\theta _{1}/2\theta _{2})^{2}}{1/2\theta _{2}} - \frac{1}{2} \log (2\pi /2\theta _{2})\right\} \\&= \exp \left\{ \langle \theta _{2}, -x^{2} \rangle + \langle \theta _{1}, x \rangle - \frac{\theta _{1}^{2}}{4\theta _{2}} - \frac{1}{2} \log (\pi ) + \frac{1}{2} \log \theta _{2}\right\} \end{aligned}$$

with the log normalizer F as

$$\begin{aligned} F(\theta _{1}, \theta _{2}) = \frac{\theta _{1}^{2}}{4\theta _{2}} + \frac{1}{2} \log (\pi ) - \frac{1}{2} \log \theta _{2} \end{aligned}$$
(56)

1.1 Gradient of the Log-Normalizer

The gradient of the log-normalizer is given by:

$$\begin{aligned} \frac{\partial F}{\partial \theta _{1}}(\theta _{1}, \theta _{2})&= \frac{\theta _{1}}{2\theta _{2}} \end{aligned}$$
(57)
$$\begin{aligned} \frac{\partial F}{\partial \theta _{2}}(\theta _{1},\theta _{2})&= -\frac{\theta _{1}^{2}}{4\theta _{2}^{2}} - \frac{1}{2\theta _{2}} \end{aligned}$$
(58)

In order to get the dual coordinate system \(\eta =(\eta _{1}, \eta _{2})\), the following set of equations has to be inverted:

$$\begin{aligned} \eta _{1}&= \frac{\theta _{1}}{2\theta _{2}} \end{aligned}$$
(59)
$$\begin{aligned} \eta _{2}&= -\frac{\theta _{1}^{2}}{4\theta _{2}^{2}} - \frac{1}{2\theta _{2}} \end{aligned}$$
(60)

By plugging the first equation into the second one, it follows:

$$\begin{aligned} \eta _{2} = - \eta _{1}^{2} - \frac{1}{2\theta _{2}} \Longleftrightarrow&\theta _{2} = -\frac{1}{2(\eta _{1}^{2} + \eta _{2})}&= \frac{\partial F^*}{\partial \eta _{2}}(\eta _{1},\eta _{2}) \end{aligned}$$
(61)
$$\begin{aligned}&\theta _{1} = 2 \theta _{2} \eta _{1} = - \frac{\eta _{1}}{(\eta _{1}^{2} + \eta _{2})}&= \frac{\partial F^*}{\partial \eta _{1}}(\eta _{1},\eta _{2}) \end{aligned}$$
(62)

Formulas are even simpler regarding the source parameters since we know that

$$\begin{aligned} \eta _{1} = \mathbb {E}[X] = \mu\Longleftrightarrow & {} \mu = \eta _{1} \end{aligned}$$
(63)
$$\begin{aligned} \eta _{2} = \mathbb {E}[-X^2] = -\left\{ \mu ^2 + \sigma ^2\right\}\Longleftrightarrow & {} \sigma ^2 = - \left\{ \eta _{1}^2 + \eta _{2}\right\} \end{aligned}$$
(64)

In order to compute \(F^{*}\), we simply have to reuse our previous results in

$$F^{*}(H) = \langle (\nabla F)^{-1} (H), H \rangle - F ( (\nabla F)^{-1} (H))$$

and obtain the following expression

The hessians H(F), \(H(F^*)\) of respectively F and \(F^*\) are

$$\begin{aligned} H(F)(\theta _1, \theta _2) = \begin{pmatrix} \frac{1}{2 \theta _2} &{} -\frac{\theta _1}{2 \theta _2^2}\\ -\frac{\theta _1}{2 \theta _2^2} &{} \frac{\theta _1^2 + \theta _2}{2 \theta _2^3} \end{pmatrix} \end{aligned}$$
(65)
$$\begin{aligned} H(F^*)(\eta _1, \eta _2) = \begin{pmatrix} \frac{\eta _1^2 - \eta _2}{(\eta _1^2 + \eta _2)^2} &{} \frac{\eta _1}{(\eta _1^2 + \eta _2)^2}\\ \frac{\eta _1}{(\eta _1^2 + \eta _2)^2} &{} \frac{1}{2(\eta _1^2 + \eta _2)^2} \end{pmatrix} \end{aligned}$$
(66)

Since the univariate normal distribution is an exponential family, the Kullback–Leibler divergence is a Bregman divergence for \(F^*\) on expectation parameters:

$$\begin{aligned} KL(\mathcal {N}(\mu _{p},\sigma ^2_{p}) || \mathcal {N}(\mu _{q},\sigma ^2_{q}))&= B_{F^*}(\eta _p : \eta _q) \\&= F^*(\eta _p) - F^*(\eta _q) - \langle \eta _p - \eta _q, \nabla F^* (\eta _q) \rangle \end{aligned}$$

After calculations, it follows:

$$\begin{aligned} B_F^*(\eta _p : \eta _q) = \frac{1}{2} \left( \log \left( \frac{\eta _{1_q}^{2} + \eta _{2_q}}{\eta _{1_p}^{2} + \eta _{2_p}}\right) + \frac{2(\eta _{1_p} - \eta _{1_q})\eta _{1_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} + \frac{\eta _{2_p} - \eta _{2_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} \right) \end{aligned}$$
(67)

A simple rewrite of it with the source parameters leads to the known closed form:

$$\begin{aligned} \frac{1}{2} \left( \log \left( \frac{\eta _{1_q}^{2} + \eta _{2_q}}{\eta _{1_p}^{2} + \eta _{2_p}}\right) + \frac{2(\eta _{1_p} - \eta _{1_q})\eta _{1_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} + \frac{\eta _{2_p} - \eta _{2_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} \right)&= \nonumber \\ \frac{1}{2} \left( \log \left( \frac{\eta _{1_q}^{2} + \eta _{2_q}}{\eta _{1_p}^{2} + \eta _{2_p}}\right) + \frac{(\eta _{1_p}^2 + \eta _{2_p}) - (\eta _{1_p}-\eta _{1_q})^2 - (\eta _{1_q}^2 + \eta _{2_q})}{(\eta _{1_q}^{2} + \eta _{2_q})} \right)&= \nonumber \\ \frac{1}{2} \left( \log \left( \frac{\sigma _q^{2}}{\sigma _p^{2}}\right) + \frac{\sigma _p^{2}}{\sigma _q^{2}} + \frac{(\mu _p-\mu _q)^2}{\sigma _q^{2}} -1 \right) \end{aligned}$$
(68)

The Fisher information matrix \(I(\lambda )\) is obtained by computing the expectation of the product of Fisher score and its transposition:

$$\begin{aligned} I(\lambda )&\mathop {=}\limits ^{def} \mathbb {E}\left[ \nabla _\lambda \log f(x;\lambda ) . \nabla _\lambda \log f(x;\lambda )^T\right] \nonumber \\&= \mathbb {E}\left[ \begin{pmatrix} \frac{x-\mu }{\sigma ^2}\\ \frac{(x-\mu )^2 - \sigma ^2}{2\sigma ^4}\end{pmatrix}. \begin{pmatrix} \frac{x-\mu }{\sigma ^2} \frac{(x-\mu )^2 - \sigma ^2}{2\sigma ^4}\end{pmatrix}\right] \nonumber \\&=\begin{pmatrix} \frac{1}{\sigma ^2} &{} 0 \\ 0 &{} \frac{1}{2\sigma ^4}\end{pmatrix}. \end{aligned}$$
(69)

By change in coordinates or direct computation, the Fisher information matrix is also:

$$\begin{aligned} I(\theta ) = H(F)(\theta ) = \begin{pmatrix}\frac{1}{2\theta _2} &{} -\frac{\theta _1}{2\theta _2^2}\\ -\frac{\theta _1}{2\theta _2^2} &{} \frac{\theta _1^2 + \theta _2}{2\theta _2^3}\end{pmatrix} \text{ and } I(\eta ) = \frac{1}{(\eta _1^2 + \eta _2)^2} \begin{pmatrix} (\eta _1^2 - \eta _2) &{} \eta _1\\ \eta _1 &{} \frac{1}{2}\end{pmatrix} \end{aligned}$$
(70)

1.2 Multivariate Gaussian Distribution as an Exponential Family

Canonical Decomposition and \({\varvec{F}}\)

$$\begin{aligned} f(x;\mu ,\varSigma )&= \frac{1}{(2 \pi )^{d / 2} |\varSigma |^{1/2}}\exp \left\{ -\frac{ {}^t (x - \mu ) \varSigma ^{-1} (x - \mu )}{2} \right\} \\&= \exp \left\{ -\frac{ {}^tx\varSigma ^{-1}x - {}^t\mu \varSigma ^{-1}x - {}^tx\varSigma ^{-1}\mu + {}^t\mu \varSigma ^{-1}\mu }{2} - \log \left( (2 \pi )^{d / 2} |\varSigma |^{1/2}\right) \right\} \\&= \exp \left\{ -\frac{tr({}^tx\varSigma ^{-1}x) - \langle {}^t\varSigma ^{-1} \mu , x \rangle -\langle x, \varSigma ^{-1}\mu \rangle + \langle {}^t\varSigma ^{-1}\mu , \varSigma \varSigma ^{-1} \mu \rangle }{2} - \log \left( \pi ^{d / 2} |2\varSigma |^{1/2}\right) \right\} \end{aligned}$$

Due to the cyclic property of the trace and to the symmetry of \(\varSigma ^{-1}\), it follows:

$$\begin{aligned} f(x;\mu ,\varSigma )&= \exp \left\{ tr\left( ^t\left( \frac{1}{2}\varSigma ^{-1}\right) (-x{}^tx)\right) + \langle \varSigma ^{-1} \mu , x \rangle - \frac{1}{2} \langle \varSigma ^{-1}\mu , \varSigma \varSigma ^{-1} \mu \rangle - \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2\varSigma |\right\} \\&= \exp \left\{ \langle \frac{1}{2}\varSigma ^{-1}, -x{}^tx \rangle _{F} + \langle \varSigma ^{-1} \mu , x \rangle - \frac{1}{4} {}^t(\varSigma ^{-1}\mu ) 2\varSigma (\varSigma ^{-1}\mu ) - \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2\varSigma | \right\} \\ \end{aligned}$$

where \(\langle \cdot , \cdot \rangle _{F}\) is the Frobenius scalar product. One may recognize the canonical form of an exponential family

$$f(x;\varTheta ) = \exp \left\{ <\varTheta ,t(x)> + k(x) - F(\varTheta )\right\} $$

by setting:

$$\varTheta = (\theta _{1}, \theta _2)$$
$$\begin{aligned} \theta _1&= \varSigma ^{-1}\mu \Longleftrightarrow \mu = \frac{1}{2}\theta _2^{-1} \theta _1\end{aligned}$$
(71)
$$\begin{aligned} \theta _2&= \frac{1}{2}\varSigma ^{-1} \Longleftrightarrow \varSigma = \frac{1}{2}\theta _2^{-1} \end{aligned}$$
(72)
$$\begin{aligned} t(x)&=(x,-x{}^tx)\end{aligned}$$
(73)
$$\begin{aligned} k(x)&= 0 \end{aligned}$$
(74)
$$\begin{aligned} f(x; \theta _1, \theta _2) = \exp \left\{ \langle \theta _2, -x{}^tx \rangle _{F} + \langle \theta _1, x \rangle - \frac{1}{4} {}^t\theta _1 \theta _2^{-1} \theta _1 - \frac{d}{2} \log (\pi ) + \frac{1}{2} \log |\theta _2| \right\} \nonumber \\ \end{aligned}$$
(75)

with the log normalizer F:

$$\begin{aligned} F(\theta _1, \theta _2) = \frac{1}{4} {}^t\theta _1 \theta _2^{-1} \theta _1 + \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |\theta _2| \end{aligned}$$
(76)

1.3 Gradient of the Log-Normalizer

By applying the following formulas from the matrix cookbook (Petersen and Pedersen 2012)

identity 57:
$$ \frac{\partial \log |X|}{\partial X} = ({}^tX)^{-1} = {}^t(X^{-1}) $$
identity 61:
$$\frac{\partial {}^ta X^{-1} b}{\partial X} = - {}^tX^{-1} a {}^tb X^{-1} $$
identity 81:
$$\frac{\partial {}^tx B x}{\partial x} = (B + {}^tB)x $$

the gradient of the log-normalizer is given by:

$$\begin{aligned} \frac{\partial F}{\partial \theta _1}(\theta _1,\theta _2)&= \frac{1}{4} (\theta _2^{-1}+ {}^{t}\theta _2^{-1}) \theta _1 = \frac{1}{2} \theta _2^{-1} \theta _1 \end{aligned}$$
(77)
$$\begin{aligned} \frac{\partial F}{\partial \theta _2}(\theta _1,\theta _2)&= - \frac{1}{4} {}^t\theta _2^{-1} \theta _1 {}^t\theta _1 \theta _2^{-1} - \frac{1}{2} {}^t\theta _2^{-1} = - \left( \frac{1}{2} \theta _2^{-1} \theta _1\right) ^t\left( \frac{1}{2} \theta _2^{-1} \theta _1\right) - \frac{1}{2} \theta _2^{-1} \end{aligned}$$
(78)

In order to emphasize the coherence of these formulas, recall that the gradient of the log-normalizer corresponds the expectation of the sufficient statistics:

$$\begin{aligned} \mathbb {E}[x]&= \mu\equiv & {} ~\frac{1}{2}\theta _2^{-1} \theta _1\end{aligned}$$
(79)
$$\begin{aligned} \mathbb {E}[-x{}^tx]&= -\mathbb {E}[x{}^tx] = -\mu {}^{t}\mu - \varSigma\equiv & {} - \left( \frac{1}{2}\theta _2^{-1} \theta _1\right) ^t\left( \frac{1}{2}\theta _2^{-1} \theta _1\right) - \frac{1}{2}\theta _2^{-1} \end{aligned}$$
(80)

Last equation comes from the expansion of \(\mathbb {E}[(x - \mu ) {}^t(x - \mu )]\).

1.4 Convex Conjugate G of F and Its Gradient

In order to get the dual coordinate system \(H=(\eta _1, \eta _2)\), the following set of equations has to be inverted:

$$\begin{aligned} \eta _1&=\frac{1}{2} \theta _2^{-1} \theta _1 \end{aligned}$$
(81)
$$\begin{aligned} \eta _2&= -\left( \frac{1}{2}\theta _2^{-1} \theta _1\right) ^t\left( \frac{1}{2}\theta _2^{-1} \theta _1\right) - \frac{1}{2}\theta _2^{-1} \end{aligned}$$
(82)

By plugging the first equation into the second one, it follows

$$\begin{aligned} \eta _2 = - \eta _1 {}^t\eta _1 - \frac{1}{2}\theta _2^{-1} \Longleftrightarrow \theta _2= \frac{1}{2}(-\eta _1 {}^t\eta _1 -\eta _2)^{-1} = \frac{\partial G}{\partial \eta _2}(\eta _1,\eta _2) \end{aligned}$$
(83)

and

$$\begin{aligned} \theta _1 = 2 \theta _2\eta _1= (- \eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 = \frac{\partial G}{\partial \eta _1}(\eta _1,\eta _2) \end{aligned}$$
(84)

Formulas are even simpler regarding the source parameters since we know from Eqs. 79 and 80 that

$$\begin{aligned} \eta _1 = \mu\Longleftrightarrow & {} \mu = \eta _1 \end{aligned}$$
(85)
$$\begin{aligned} \eta _2= -\mu {}^{t}\mu - \varSigma\Longleftrightarrow & {} \varSigma = - \eta _1 {}^t\eta _1 - \eta _2 \end{aligned}$$
(86)

In order to compute \(G := F^{*}\), we simply have to reuse our previous results in

$$G(H) = \langle (\nabla F)^{-1} (H), H \rangle - F ( (\nabla F)^{-1} (H))$$

and obtain the following expression

$$\begin{aligned} G(\eta _1, \eta _2)&= \langle (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1, \eta _1 \rangle + \langle \frac{1}{2} (- \eta _1 {}^t\eta _1 - \eta _2)^{-1}, \eta _2 \rangle _{F}\\&- \frac{1}{4} {}^t((-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1) 2(-\eta _1 {}^t\eta _1 - \eta _2) (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 \\&- \frac{d}{2} \log (\pi ) + \frac{1}{2} \log |\frac{1}{2} (-\eta _1 {}^t\eta _1 - \eta _2)^{-1}|\\&= {}^t \eta _1 (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 +\frac{1}{2} tr({}^{t}(-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _2)\\&- \frac{1}{2} {}^t\eta _1{}^t(-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1\\&- \frac{d}{2} \log (\pi ) + \frac{1}{2} \log |(2(-\eta _1 {}^t\eta _1 - \eta _2))^{-1}|\\&= \frac{1}{2} {}^t \eta _1 (- \eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 +\frac{1}{2} tr((-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _2)\\&- \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= \frac{1}{2} \left( tr((- \eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 {}^t \eta _1 ) +tr((-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _2)\right) \\&- \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= - \frac{1}{2} tr((- \eta _1 {}^t\eta _1 - \eta _2)^{-1} (- \eta _1 {}^t \eta _1 - \eta _2)) - \frac{d}{2} \log (\pi )\nonumber \\&- \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= - \frac{1}{2} tr(I_{d}) - \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= - \frac{d}{2} \log (e\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\ \end{aligned}$$

Let us rewrite this expression with source parameters:

$$\begin{aligned} G(\mu , \varSigma ) = - \frac{d}{2} \log (e\pi ) - \frac{1}{2} \log |2\varSigma |\ \end{aligned}$$
(87)

1.5 Kullback–Leibler Divergence

First recall that the Kullback–Leibler divergence between two p.d.f. p and q is

$$ KL(p || q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$

For two multivariate normal distributions, it is known in closed form

$$\begin{aligned} KL(\mathcal {N}(\mu _{p},\varSigma _{p}) || \mathcal {N}(\mu _{q},\varSigma _{q})) = \frac{1}{2}\left( \log \left( \frac{|\varSigma _{q}|}{|\varSigma _{p}|}\right) + tr(\varSigma _{q}^{-1}\varSigma _{p}) + {}^{t}(\mu _{q}-\mu _{p})\varSigma _{q}^{-1}(\mu _{q}-\mu _{p}) - d\right) \end{aligned}$$
(88)

Since the multivariate normal distribution is an E.F., the same result must be obtained using the bregman divergence for G on expectation parameters \(H_{p}\) and \(H_{q}\):

$$KL(\mathcal {N}(\mu _{p},\varSigma _{p}) || \mathcal {N}(\mu _{q},\varSigma _{q})) = B_G(H_p || H_q) = G(H_{p}) - G(H_{q}) - \langle H_{p} - H_{q}, \nabla G (H_{q}) \rangle $$
$$\begin{aligned} G(H_{p}) - G(H_{q})&= - \frac{d}{2} \log (e\pi ) - \frac{1}{2} \log |-2(\eta _{1_{p}} {}^t\eta _{1_{p}} + \eta _{2_{p}})| \\&+ \frac{d}{2} \log (e\pi ) + \frac{1}{2} \log |-2(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})| \\&= \frac{1}{2} \log \frac{|-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})|}{|-(\eta _{1_{p}} {}^t\eta _{1_{p}} + \eta _{2_{p}})|}\\ - \langle H_{p} - H_{q}, \nabla G (H_{q}) \rangle&= - \langle \eta _{1_{p}} - \eta _{1_{q}}, - (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}} \rangle \\&- tr\left( ^{t} (\eta _{2_{p}} - \eta _{2_{q}}) \left( -\frac{1}{2} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}\right) \right) \\&= {}^t \eta _{1_{p}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}} - {}^t \eta _{1_{q}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}} \\&- \frac{1}{2} tr({}^{t} \eta _{2_{p}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1})) + \frac{1}{2} tr({}^{t}\eta _{2_{q}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}))\\ \end{aligned}$$

In order to go further, we can express these two formulas using \(\mu \) and \(\varSigma ^{-1} = (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} = -(\eta _1 {}^t\eta _1 + \eta _2)^{-1} \) (cf. Eq. 86):

$$\begin{aligned} \frac{1}{2} \log \frac{|-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})|}{|-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})|}&= \frac{1}{2} \log \frac{|\varSigma _q|}{|\varSigma _p|} \end{aligned}$$
$$\begin{aligned} {}^t \eta _{1_{p}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}}&= -{}^t \mu _{p} \varSigma _{q}^{-1} \mu _{q}\\ - {}^t \eta _{1_{q}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}}&= {}^t \mu _{q} \varSigma _{q}^{-1} \mu _{q} \end{aligned}$$
$$\begin{aligned} - \frac{1}{2} tr({}^{t} \eta _{2_{p}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}))&= \frac{1}{2} tr((\mu _{p}{}^{t}\mu _{p} + \varSigma _{p}) \varSigma _{q}^{-1})\\&= \frac{1}{2} tr(\mu _{p}{}^{t}\mu _{p}\varSigma _{q}^{-1}) + \frac{1}{2} tr(\varSigma _{p}\varSigma _{q}^{-1})\\&= \frac{1}{2} {}^{t}\mu _{p}\varSigma _{q}^{-1}\mu _{p} + \frac{1}{2} tr(\varSigma _{q}^{-1}\varSigma _{p})\\ + \frac{1}{2} tr({}^{t}\eta _{2_{q}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}))&= - \frac{1}{2} tr((\mu _{q}{}^{t}\mu _{q} + \varSigma _{q}) \varSigma _{q}^{-1})\\&= - \frac{1}{2} tr(\mu _{q}{}^{t}\mu _{q}\varSigma _{q}^{-1}) - \frac{1}{2} tr(\varSigma _{q}\varSigma _{q}^{-1})\\&= - \frac{1}{2} {}^{t}\mu _{q}\varSigma _{q}^{-1}\mu _{q} - \frac{1}{2} d \end{aligned}$$

By summing up of these terms, the standard formula for KL divergence is recovered:

$$\begin{aligned} KL(\mathcal {N}(\mu _{p},\varSigma _{p})&|| \mathcal {N}(\mu _{q},\varSigma _{q})) = \frac{1}{2} \log \frac{|\varSigma _q|}{|\varSigma _p|} -{}^t \mu _{p} \varSigma _{q}^{-1} \mu _{q} +{}^t \mu _{q} \varSigma _{q}^{-1} \mu _{q} + \\&\frac{1}{2} {}^{t}\mu _{p}\varSigma _{q}^{-1}\mu _{p} + \frac{1}{2} tr(\varSigma _{q}^{-1}\varSigma _{p}) - \frac{1}{2} {}^{t}\mu _{q}\varSigma _{q}^{-1}\mu _{q} - \frac{1}{2} d\\ =&\frac{1}{2} \left( \log \frac{|\varSigma _q|}{|\varSigma _p|} + tr(\varSigma _{q}^{-1}\varSigma _{p}) - d~-\right. \\&\left. \left\{ 2{}^t \mu _{p} \varSigma _{q}^{-1} \mu _{q} - 2 {}^t \mu _{q} \varSigma _{q}^{-1} \mu _{q} - ^{t}\mu _{p}\varSigma _{q}^{-1}\mu _{p} + {}^{t}\mu _{q}\varSigma _{q}^{-1}\mu _{q}\right\} \right) \\ =&\frac{1}{2}\left( \log \frac{|\varSigma _q|}{|\varSigma _p|} + tr(\varSigma _{q}^{-1}\varSigma _{p}) - {}^t (\mu _{p} - \mu _{q}) \varSigma _{q}^{-1} (\mu _{p} - \mu _{q}) - d \right) \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Saint-Jean, C., Nielsen, F. (2017). Batch and Online Mixture Learning: A Review with Extensions. In: Nielsen, F., Critchley, F., Dodson, C. (eds) Computational Information Geometry. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-47058-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47058-0_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47056-6

  • Online ISBN: 978-3-319-47058-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics