Natural gradient via optimal transport

Li, Wuchen; Montúfar, Guido

doi:10.1007/s41884-018-0015-3

Natural gradient via optimal transport

Research Paper
Published: 19 November 2018

Volume 1, pages 181–214, (2018)
Cite this article

Information Geometry Aims and scope Submit manuscript

1959 Accesses
31 Citations
4 Altmetric
Explore all metrics

Abstract

We study a natural Wasserstein gradient flow on manifolds of probability distributions with discrete sample spaces. We derive the Riemannian structure for the probability simplex from the dynamical formulation of the Wasserstein distance on a weighted graph. We pull back the geometric structure to the parameter space of any given probability model, which allows us to define a natural gradient flow there. In contrast to the natural Fisher–Rao gradient, the natural Wasserstein gradient incorporates a ground metric on sample space. We illustrate the analysis of elementary exponential family examples and demonstrate an application of the Wasserstein natural gradient to maximum likelihood estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal transport natural gradient for statistical manifolds with continuous sample space

Article 11 May 2020

Conformal mirror descent with logarithmic divergences

Article Open access 14 December 2022

A Formalization of the Natural Gradient Method for General Similarity Measures

Notes

A length space is one in which the distance between points can be measured as the infimum length of continuous curves between them.
We use the direct method, which is a standard technique in optimal control. Here the time is discretized, and the sum replacing the integral is minimized by means of gradient descent with respect to $(p(t)_i)_{i=1,3, t\in \{t_1,\ldots , t_N\}} \in \mathbb {R}^{2\times N}$. A reference for these techniques is [24].

References

Amari, S.: Neural learning in structured parameter spaces-natural Riemannian gradient. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems 9, pp. 127–133. MIT, London (1997)
Google Scholar
Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Article Google Scholar
Amari, S.: Information Geometry and Its Applications. Number volume 194 in Applied mathematical sciences. Springer, Tokyo (2016)
Book Google Scholar
Amari, S., Karakida, R., Oizumi, M.: Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the Entropy-Relaxed Transportation Problem (2017). arXiv:1709.10219 [cs, math]
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN (2017). arXiv:1701.07875 [cs, stat]
Ay, N., Jost, J., Lê, H., Schwachhöfer, L.: Information Geometry Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in Mathematics. Springer, Berlin (2017)
Google Scholar
Bakry, D., Émery, M.: Diffusions hypercontractives. In: Azéma, J., Yor, M. (eds.) Séminaire de Probabilités XIX 1983/84, pp. 177–206. Springer, Berlin (1985)
Chapter Google Scholar
Benamou, J.-D., Brenier, Y.: A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik 84(3), 375–393 (2000)
Article MathSciNet Google Scholar
Campbell, L.: An extended Čencov characterization of the information metric. Proc. Am. Math. Soc. 98, 135–141 (1986)
Google Scholar
Carlen, E.A., Gangbo, W.: Constrained Steepest Descent in the 2-Wasserstein Metric. Ann. Math. 157(3), 807–846 (2003)
Article MathSciNet Google Scholar
Čencov, N.N.: Statistical Decision Rules and Optimal Inference. Translations of Mathematical Monographs, vol. 53. American Mathematical Society, Providence (1982). (Translation from the Russian edited by Lev J. Leifman)
Google Scholar
Chow, S.-N., Huang, W., Li, Y., Zhou, H.: Fokker–Planck equations for a free energy functional or markov process on a graph. Arch. Ration. Mech. Anal. 203(3), 969–1008 (2012)
Article MathSciNet Google Scholar
Chow, S.-N., Li, W., Zhou, H.: A discrete Schrodinger equation via optimal transport on graphs (2017). arXiv:1705.07583 [math]
Chow, S.-N., Li, W., Zhou, H.: Entropy dissipation of Fokker–Planck equations on graphs. Discrete Contin. Dyn. Syst. A 38(10), 4929–4950 (2018)
Article MathSciNet Google Scholar
Chung, F. R. K.: Spectral Graph Theory. Number no. 92 in Regional conference series in mathematics. In: Published for the Conference Board of the mathematical sciences by the American Mathematical Society, Providence, R.I. (1997)
Frogner, C., Zhang, C., Mobahi, H., Araya-Polo, M., Poggio, T.: Learning with a Wasserstein loss (2015). arXiv:1506.05439 [cs, stat]
Gangbo, W., Li, W., Mou, C.: Geodesic of minimal length in the set of probability measures on graphs. accepted in ESAIM: COCV (2018)
Karakida, R., Amari, S.: Information geometry of wasserstein divergence. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information, pp. 119–126. Springer, Cham (2017)
Chapter Google Scholar
Kingma, D. P., Adam, J. Ba.: A method for stochastic optimization (2014). CoRR, arXiv:1412.6980
Lafferty, J.D.: The density manifold and configuration space quantization. Trans. Am. Math. Soc. 305(2), 699–741 (1988)
Article MathSciNet Google Scholar
Lebanon, G.: Axiomatic geometry of conditional models. IEEE Trans. Inf. Theory 51(4), 1283–1294 (2005)
Article MathSciNet Google Scholar
Li, W.: Geometry of probability simplex via optimal transport (2018). arXiv:1803.06360 [math]
Li, W., Montufar, G.: Ricci curvature for parameter statistics via optimal transport (2018). arXiv:1807.07095
Li, W., Yin, P., Osher, S.: Computations of optimal transport distance with fisher information regularization. J. Sci. Comput. 75, 1581–1595 (2017)
Article MathSciNet Google Scholar
Lott, J.: Some geometric calculations on Wasserstein space. Commun. Math. Phys. 277(2), 423–437 (2007)
Article MathSciNet Google Scholar
Maas, J.: Gradient flows of the entropy for finite Markov chains. J. Funct. Anal. 261(8), 2250–2292 (2011)
Article MathSciNet Google Scholar
Malagò, L., Matteucci, M., Pistone, G.: Towards the geometry of estimation of distribution algorithms based on the exponential family. In: Proceedings of the 11th Workshop Proceedings on Foundations of Genetic Algorithms, FOGA ’11, New York, NY, USA, 2011. ACM, pp. 230–242
Malagò, L., Pistone, G.: Natural gradient flow in the mixture geometry of a discrete exponential family. Entropy 17(12), 4215–4254 (2015)
Article MathSciNet Google Scholar
Mielke, A.: A gradient structure for reaction–diffusion systems and for energy-drift-diffusion systems. Nonlinearity 24(4), 1329–1346 (2011)
Article MathSciNet Google Scholar
Modin, K.: Geometry of matrix decompositions seen through optimal transport and information geometry. J. Geometr. Mech. 9(3), 335–390 (2017)
Article MathSciNet Google Scholar
Montavon, G., Müller, K.-R., Cuturi, M.: Wasserstein training of restricted boltzmann machines. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 3718–3726. Curran Associates Inc, Red Hook (2016)
Google Scholar
Montúfar, G., Rauh, J., Ay, N.: On the Fisher metric of conditional probability polytopes. Entropy 16(6), 3207–3233 (2014)
Article MathSciNet Google Scholar
Nelson, E.: Quantum Fluctuations. Princeton series in physics. Princeton University Press, Princeton (1985)
MATH Google Scholar
Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Commun. Partial Diff. Equ. 26(1–2), 101–174 (2001)
Article MathSciNet Google Scholar
Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. In: International Conference on Learning Representations 2014 (Conference Track) (2014)
Peters, J., Vijayakumar, S., Schaal, S.: Natural actor-critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) Machine Learning: ECML 2005, pp. 280–291. Springer, Berlin (2005)
Chapter Google Scholar
Takatsu, A.: Wasserstein geometry of Gaussian measures. Osaka J. Math. 48(4), 1005–1026 (2011)
MathSciNet MATH Google Scholar
Villani, C.: Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer, Berlin (2009)
Book Google Scholar
Wong, T.-K.: Logarithmic divergences from optimal transport and Rényi geometry (2017). arXiv:1712.03610 [cs, math, stat]
Yi, S., Wierstra, D., Schaul, T., Schmidhuber, J.: Stochastic search using the natural gradient. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA. ACM, pp. 1161–1168 (2009)

Download references

Acknowledgements

The authors would like to thank Prof. Luigi Malagò for his inspiring talk at UCLA in December 2017. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement n^o 757983).

Author information

Authors and Affiliations

Department of Mathematics, University of California, Los Angeles, USA
Wuchen Li
Department of Mathematics and Department of Statistics, University of California, Los Angeles, USA
Guido Montúfar
Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
Guido Montúfar

Authors

Wuchen Li
View author publications
You can also search for this author in PubMed Google Scholar
Guido Montúfar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wuchen Li.

Appendices

Appendix

In this appendix we review the equivalence of static and dynamical formulations of the $L^2$-Wasserstein metric formally. For more details see [38].

Consider the duality of linear programming.

$$\begin{aligned} \begin{aligned}&\frac{1}{2}W(\rho ^0,\rho ^1)^2\\&\quad =\inf _{\pi \ge 0}\Big \{\int _{\Omega }\int _{\Omega }\frac{1}{2}d_{\Omega }(x,y)^2\pi (x,y)dxdy:\int _\Omega \pi dy=\rho ^0(x),~\int _\Omega \pi dx=\rho ^1(y)\Big \}\\&\quad =\sup _{\Phi ^1, \Phi ^0}\Big \{\int _{\Omega }\Phi ^1(y)\rho ^1(y)dy-\int _\Omega \Phi ^0(x)\rho ^0(x)dx:\Phi ^1(y)-\Phi ^1(x)\le \frac{1}{2}d_\Omega (x,y)^2\Big \}. \end{aligned} \end{aligned}$$

(21)

By standard considerations, the supremum in the last formula is attained when

$$\begin{aligned} \Phi ^1(y)=\sup _{x\in \Omega }~\Phi ^0(x)+\frac{1}{2}d_\Omega (x,y)^2. \end{aligned}$$

(22)

This means that $\Phi ^1$, $\Phi ^0$ are related to the viscosity solution of the Hamilton-Jacobi equation on $\Omega $:

$$\begin{aligned} \frac{\partial \Phi (t,x)}{\partial t}+\frac{1}{2}g_x^\Omega (\nabla \Phi (t,x), \nabla \Phi (t,x))=0, \end{aligned}$$

(23)

with $\Phi ^0(x)=\Phi (0,x)$, $\Phi ^1(x)=\Phi (1,x)$. Hence (21) becomes

$$\begin{aligned}&\frac{1}{2}W(\rho ^0,\rho ^1)^2\\&\quad =\sup _{\Phi }\Big \{\int _{\Omega }\Phi ^1(x)\rho ^1(x)-\Phi ^0(x)\rho ^0(x)dx:\frac{\partial \Phi (t,x)}{\partial t}+\frac{1}{2}g_x^\Omega (\nabla \Phi (t,x), \nabla \Phi (t,x))=0 \Big \}. \end{aligned}$$

By the duality of above formulas, we can obtain variational problem (1). In other words, consider the dual variable of $\Phi _t=\Phi (t,x)$ by the density path $\rho _t=\rho (t,x)$, then

$$\begin{aligned} \begin{aligned}&\frac{1}{2}W(\rho ^0,\rho ^1)^2\\&\quad =\sup _{\Phi _t}\inf _{\rho _t}~\int _{\Omega }\Phi ^1\rho ^1-\Phi ^0\rho ^0dx-\int _0^1\int _{\Omega }\rho _t\big [ \partial _t\Phi _t+\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)dx\big ] dt\\&\quad =\sup _{\Phi _t}\inf _{\rho _t}~\int _{\Omega }\Phi ^1\rho ^1-\Phi ^0\rho ^0dx-\int _0^1\int _{\Omega }\rho _t \partial _t\Phi _tdxdt\\&\qquad - \int _0^1\int _{\Omega }\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt\\&\quad =\sup _{\Phi _t}\inf _{\rho _t}~\int _0^1\int _{\Omega }\partial _t\rho _t \Phi _t-g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt+\int _0^1\int _{\Omega }\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt \\&\quad =\inf _{\rho _t}\sup _{\Phi _t}~\int _0^1\int _{\Omega }\Phi _t(\partial _t\rho _t+\text {div}(\rho \nabla \Phi _t)) dt+\int _0^1\int _{\Omega }\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt \\&\quad =\inf _{\rho _t}~\Big \{\int _0^1\int _{\Omega }\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt:\partial _t\rho _t\\&\qquad +\text {div}(\rho \nabla \Phi _t)=0,~\rho _0=\rho ^0, ~\rho _1=\rho ^1\Big \}. \end{aligned} \end{aligned}$$

The third equality is derived by integration by parts w.r.t. t and the fourth equality is by switching infimum and supremum relations and integration by parts w.r.t. x.

In the above derivations, the relation of Hopf–Lax formula (22) and Hamilton–Jacobi equation (23) plays a key role for the equivalence of static and dynamic formulations of the Wasserstein metric. This is also a consequence of the fact that the sample space $\Omega $ is a length space, i.e.,

$$\begin{aligned} d_\Omega (x,y)^2=\inf _{\gamma (t)}\Big \{\int _0^1g_{\gamma (t)}^\Omega (\dot{\gamma }, \dot{\gamma })dt:\gamma (0)=x,~\gamma (1)=y\Big \}. \end{aligned}$$

However, in a discrete sample space I, there is no path $\gamma (t)\in I$ connecting two discrete points. Thus the relation between (22) and (23) does not hold on I. This indicates that in discrete sample spaces, the Wasserstein metric in Definition 1 can be different from the one defined by linear programming (5). See many related discussions in [12, 26].

Notations

We use the following notations.

Continuous/discrete sample space	$\Omega $	I
Inner product	$g^\Omega $	$g^I$
Gradient	$\nabla $	$\nabla _G$
divergence	$\text {div}$	$\text {div}_G$
Hessian in $\Omega $	Hess
Potential function set	$\mathcal {F}(\Omega )$	$\mathcal {F}(I)$
Weighted Laplacian operator	$-\nabla \cdot (\rho \nabla )$	L(p)

Continuous/discrete probability space	$\mathcal {P}_+(\Omega )$	$\mathcal {P}_+(I)$
Probability distribution	$\rho $	p
Tangent space	$T_\rho \mathcal {P}_+(\Omega )$	$T_p\mathcal {P}_+(I)$
Wasserstein metric tensor	$g^W$	$g^W$
Dual coordinates	$\Phi (x)$	$(\Phi _i)_{i=1}^n$
Primal coordinates	$\sigma (x)$	$(\sigma _i)_{i=1}^n$
First differential operator	$\delta _\rho $	$\nabla _p$
Second differential operator	$\delta ^2_{\rho \rho }$
Gradient operator		$\nabla _W$
Hessian operator		$\text {Hess}_W$
Levi–Civita connection		$\nabla ^W_{\cdot }\cdot $

Parameter space/Probability model	$\Theta $	$p(\Theta )$
Inner product	$g_\theta $	$g_{p(\theta )}$
Tangent space	$T_\theta \Theta $	$T_{p(\theta )}p(\Theta )$
$L^2$-Wasserstein matrix	$G(\theta )$
$L^2$-Wasserstein distance	$\text {Dist}$	$\text {Dist}$
Second fundamental form		$B(\cdot , \cdot )$
Projection operator		H
Levi–Civita connection		$(\nabla ^W_\cdot \cdot )^{\|\|}$
Jacobi operator	$J_\theta $
First differential operator	$\nabla _\theta $
Gradient operator	$\nabla _g$
Hessian operator	$\text {Hess}_g$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, W., Montúfar, G. Natural gradient via optimal transport. Info. Geo. 1, 181–214 (2018). https://doi.org/10.1007/s41884-018-0015-3

Download citation

Received: 15 March 2018
Revised: 27 August 2018
Published: 19 November 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s41884-018-0015-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Continuous/discrete sample space	\(\Omega \)	I
Inner product	\(g^\Omega \)	\(g^I\)
Gradient	\(\nabla \)	\(\nabla _G\)
divergence	\(\text {div}\)	\(\text {div}_G\)
Hessian in \(\Omega \)	Hess
Potential function set	\(\mathcal {F}(\Omega )\)	\(\mathcal {F}(I)\)
Weighted Laplacian operator	\(-\nabla \cdot (\rho \nabla )\)	L(p)

Continuous/discrete probability space	\(\mathcal {P}_+(\Omega )\)	\(\mathcal {P}_+(I)\)
Probability distribution	\(\rho \)	p
Tangent space	\(T_\rho \mathcal {P}_+(\Omega )\)	\(T_p\mathcal {P}_+(I)\)
Wasserstein metric tensor	\(g^W\)	\(g^W\)
Dual coordinates	\(\Phi (x)\)	\((\Phi _i)_{i=1}^n\)
Primal coordinates	\(\sigma (x)\)	\((\sigma _i)_{i=1}^n\)
First differential operator	\(\delta _\rho \)	\(\nabla _p\)
Second differential operator	\(\delta ^2_{\rho \rho }\)
Gradient operator		\(\nabla _W\)
Hessian operator		\(\text {Hess}_W\)
Levi–Civita connection		\(\nabla ^W_{\cdot }\cdot \)

Parameter space/Probability model	\(\Theta \)	\(p(\Theta )\)
Inner product	\(g_\theta \)	\(g_{p(\theta )}\)
Tangent space	\(T_\theta \Theta \)	\(T_{p(\theta )}p(\Theta )\)
\(L^2\)-Wasserstein matrix	\(G(\theta )\)
\(L^2\)-Wasserstein distance	\(\text {Dist}\)	\(\text {Dist}\)
Second fundamental form		\(B(\cdot , \cdot )\)
Projection operator		H
Levi–Civita connection		\((\nabla ^W_\cdot \cdot )^{\|\|}\)
Jacobi operator	\(J_\theta \)
First differential operator	\(\nabla _\theta \)
Gradient operator	\(\nabla _g\)
Hessian operator	\(\text {Hess}_g\)

Natural gradient via optimal transport

Abstract

Access this article

Similar content being viewed by others

Optimal transport natural gradient for statistical manifolds with continuous sample space

Conformal mirror descent with logarithmic divergences

A Formalization of the Natural Gradient Method for General Similarity Measures

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Notations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Natural gradient via optimal transport

Abstract

Access this article

Similar content being viewed by others

Optimal transport natural gradient for statistical manifolds with continuous sample space

Conformal mirror descent with logarithmic divergences

A Formalization of the Natural Gradient Method for General Similarity Measures

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Notations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation