Abstract
We study a natural Wasserstein gradient flow on manifolds of probability distributions with discrete sample spaces. We derive the Riemannian structure for the probability simplex from the dynamical formulation of the Wasserstein distance on a weighted graph. We pull back the geometric structure to the parameter space of any given probability model, which allows us to define a natural gradient flow there. In contrast to the natural Fisher–Rao gradient, the natural Wasserstein gradient incorporates a ground metric on sample space. We illustrate the analysis of elementary exponential family examples and demonstrate an application of the Wasserstein natural gradient to maximum likelihood estimation.
Similar content being viewed by others
Notes
A length space is one in which the distance between points can be measured as the infimum length of continuous curves between them.
We use the direct method, which is a standard technique in optimal control. Here the time is discretized, and the sum replacing the integral is minimized by means of gradient descent with respect to \((p(t)_i)_{i=1,3, t\in \{t_1,\ldots , t_N\}} \in \mathbb {R}^{2\times N}\). A reference for these techniques is [24].
References
Amari, S.: Neural learning in structured parameter spaces-natural Riemannian gradient. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems 9, pp. 127–133. MIT, London (1997)
Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Amari, S.: Information Geometry and Its Applications. Number volume 194 in Applied mathematical sciences. Springer, Tokyo (2016)
Amari, S., Karakida, R., Oizumi, M.: Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the Entropy-Relaxed Transportation Problem (2017). arXiv:1709.10219 [cs, math]
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN (2017). arXiv:1701.07875 [cs, stat]
Ay, N., Jost, J., Lê, H., Schwachhöfer, L.: Information Geometry Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in Mathematics. Springer, Berlin (2017)
Bakry, D., Émery, M.: Diffusions hypercontractives. In: Azéma, J., Yor, M. (eds.) Séminaire de Probabilités XIX 1983/84, pp. 177–206. Springer, Berlin (1985)
Benamou, J.-D., Brenier, Y.: A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik 84(3), 375–393 (2000)
Campbell, L.: An extended Čencov characterization of the information metric. Proc. Am. Math. Soc. 98, 135–141 (1986)
Carlen, E.A., Gangbo, W.: Constrained Steepest Descent in the 2-Wasserstein Metric. Ann. Math. 157(3), 807–846 (2003)
Čencov, N.N.: Statistical Decision Rules and Optimal Inference. Translations of Mathematical Monographs, vol. 53. American Mathematical Society, Providence (1982). (Translation from the Russian edited by Lev J. Leifman)
Chow, S.-N., Huang, W., Li, Y., Zhou, H.: Fokker–Planck equations for a free energy functional or markov process on a graph. Arch. Ration. Mech. Anal. 203(3), 969–1008 (2012)
Chow, S.-N., Li, W., Zhou, H.: A discrete Schrodinger equation via optimal transport on graphs (2017). arXiv:1705.07583 [math]
Chow, S.-N., Li, W., Zhou, H.: Entropy dissipation of Fokker–Planck equations on graphs. Discrete Contin. Dyn. Syst. A 38(10), 4929–4950 (2018)
Chung, F. R. K.: Spectral Graph Theory. Number no. 92 in Regional conference series in mathematics. In: Published for the Conference Board of the mathematical sciences by the American Mathematical Society, Providence, R.I. (1997)
Frogner, C., Zhang, C., Mobahi, H., Araya-Polo, M., Poggio, T.: Learning with a Wasserstein loss (2015). arXiv:1506.05439 [cs, stat]
Gangbo, W., Li, W., Mou, C.: Geodesic of minimal length in the set of probability measures on graphs. accepted in ESAIM: COCV (2018)
Karakida, R., Amari, S.: Information geometry of wasserstein divergence. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information, pp. 119–126. Springer, Cham (2017)
Kingma, D. P., Adam, J. Ba.: A method for stochastic optimization (2014). CoRR, arXiv:1412.6980
Lafferty, J.D.: The density manifold and configuration space quantization. Trans. Am. Math. Soc. 305(2), 699–741 (1988)
Lebanon, G.: Axiomatic geometry of conditional models. IEEE Trans. Inf. Theory 51(4), 1283–1294 (2005)
Li, W.: Geometry of probability simplex via optimal transport (2018). arXiv:1803.06360 [math]
Li, W., Montufar, G.: Ricci curvature for parameter statistics via optimal transport (2018). arXiv:1807.07095
Li, W., Yin, P., Osher, S.: Computations of optimal transport distance with fisher information regularization. J. Sci. Comput. 75, 1581–1595 (2017)
Lott, J.: Some geometric calculations on Wasserstein space. Commun. Math. Phys. 277(2), 423–437 (2007)
Maas, J.: Gradient flows of the entropy for finite Markov chains. J. Funct. Anal. 261(8), 2250–2292 (2011)
Malagò, L., Matteucci, M., Pistone, G.: Towards the geometry of estimation of distribution algorithms based on the exponential family. In: Proceedings of the 11th Workshop Proceedings on Foundations of Genetic Algorithms, FOGA ’11, New York, NY, USA, 2011. ACM, pp. 230–242
Malagò, L., Pistone, G.: Natural gradient flow in the mixture geometry of a discrete exponential family. Entropy 17(12), 4215–4254 (2015)
Mielke, A.: A gradient structure for reaction–diffusion systems and for energy-drift-diffusion systems. Nonlinearity 24(4), 1329–1346 (2011)
Modin, K.: Geometry of matrix decompositions seen through optimal transport and information geometry. J. Geometr. Mech. 9(3), 335–390 (2017)
Montavon, G., Müller, K.-R., Cuturi, M.: Wasserstein training of restricted boltzmann machines. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 3718–3726. Curran Associates Inc, Red Hook (2016)
Montúfar, G., Rauh, J., Ay, N.: On the Fisher metric of conditional probability polytopes. Entropy 16(6), 3207–3233 (2014)
Nelson, E.: Quantum Fluctuations. Princeton series in physics. Princeton University Press, Princeton (1985)
Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Commun. Partial Diff. Equ. 26(1–2), 101–174 (2001)
Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. In: International Conference on Learning Representations 2014 (Conference Track) (2014)
Peters, J., Vijayakumar, S., Schaal, S.: Natural actor-critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) Machine Learning: ECML 2005, pp. 280–291. Springer, Berlin (2005)
Takatsu, A.: Wasserstein geometry of Gaussian measures. Osaka J. Math. 48(4), 1005–1026 (2011)
Villani, C.: Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer, Berlin (2009)
Wong, T.-K.: Logarithmic divergences from optimal transport and Rényi geometry (2017). arXiv:1712.03610 [cs, math, stat]
Yi, S., Wierstra, D., Schaul, T., Schmidhuber, J.: Stochastic search using the natural gradient. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA. ACM, pp. 1161–1168 (2009)
Acknowledgements
The authors would like to thank Prof. Luigi Malagò for his inspiring talk at UCLA in December 2017. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement no 757983).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix
In this appendix we review the equivalence of static and dynamical formulations of the \(L^2\)-Wasserstein metric formally. For more details see [38].
Consider the duality of linear programming.
By standard considerations, the supremum in the last formula is attained when
This means that \(\Phi ^1\), \(\Phi ^0\) are related to the viscosity solution of the Hamilton-Jacobi equation on \(\Omega \):
with \(\Phi ^0(x)=\Phi (0,x)\), \(\Phi ^1(x)=\Phi (1,x)\). Hence (21) becomes
By the duality of above formulas, we can obtain variational problem (1). In other words, consider the dual variable of \(\Phi _t=\Phi (t,x)\) by the density path \(\rho _t=\rho (t,x)\), then
The third equality is derived by integration by parts w.r.t. t and the fourth equality is by switching infimum and supremum relations and integration by parts w.r.t. x.
In the above derivations, the relation of Hopf–Lax formula (22) and Hamilton–Jacobi equation (23) plays a key role for the equivalence of static and dynamic formulations of the Wasserstein metric. This is also a consequence of the fact that the sample space \(\Omega \) is a length space, i.e.,
However, in a discrete sample space I, there is no path \(\gamma (t)\in I\) connecting two discrete points. Thus the relation between (22) and (23) does not hold on I. This indicates that in discrete sample spaces, the Wasserstein metric in Definition 1 can be different from the one defined by linear programming (5). See many related discussions in [12, 26].
Notations
We use the following notations.
Continuous/discrete sample space | \(\Omega \) | I |
---|---|---|
Inner product | \(g^\Omega \) | \(g^I\) |
Gradient | \(\nabla \) | \(\nabla _G\) |
divergence | \(\text {div}\) | \(\text {div}_G\) |
Hessian in \(\Omega \) | Hess | |
Potential function set | \(\mathcal {F}(\Omega )\) | \(\mathcal {F}(I)\) |
Weighted Laplacian operator | \(-\nabla \cdot (\rho \nabla )\) | L(p) |
Continuous/discrete probability space | \(\mathcal {P}_+(\Omega )\) | \(\mathcal {P}_+(I)\) |
---|---|---|
Probability distribution | \(\rho \) | p |
Tangent space | \(T_\rho \mathcal {P}_+(\Omega )\) | \(T_p\mathcal {P}_+(I)\) |
Wasserstein metric tensor | \(g^W\) | \(g^W\) |
Dual coordinates | \(\Phi (x)\) | \((\Phi _i)_{i=1}^n\) |
Primal coordinates | \(\sigma (x)\) | \((\sigma _i)_{i=1}^n\) |
First differential operator | \(\delta _\rho \) | \(\nabla _p\) |
Second differential operator | \(\delta ^2_{\rho \rho }\) | |
Gradient operator | \(\nabla _W\) | |
Hessian operator | \(\text {Hess}_W\) | |
Levi–Civita connection | \(\nabla ^W_{\cdot }\cdot \) |
Parameter space/Probability model | \(\Theta \) | \(p(\Theta )\) |
---|---|---|
Inner product | \(g_\theta \) | \(g_{p(\theta )}\) |
Tangent space | \(T_\theta \Theta \) | \(T_{p(\theta )}p(\Theta )\) |
\(L^2\)-Wasserstein matrix | \(G(\theta )\) | |
\(L^2\)-Wasserstein distance | \(\text {Dist}\) | \(\text {Dist}\) |
Second fundamental form | \(B(\cdot , \cdot )\) | |
Projection operator | H | |
Levi–Civita connection | \((\nabla ^W_\cdot \cdot )^{||}\) | |
Jacobi operator | \(J_\theta \) | |
First differential operator | \(\nabla _\theta \) | |
Gradient operator | \(\nabla _g\) | |
Hessian operator | \(\text {Hess}_g\) |
Rights and permissions
About this article
Cite this article
Li, W., Montúfar, G. Natural gradient via optimal transport. Info. Geo. 1, 181–214 (2018). https://doi.org/10.1007/s41884-018-0015-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41884-018-0015-3