Abstract
Monte Carlo algorithms often aim to draw from a distribution \(\pi \) by simulating a Markov chain with transition kernel \(P\) such that \(\pi \) is invariant under \(P\). However, there are many situations for which it is impractical or impossible to draw from the transition kernel \(P\). For instance, this is the case with massive datasets, where is it prohibitively expensive to calculate the likelihood and is also the case for intractable likelihood models arising from, for example, Gibbs random fields, such as those found in spatial statistics and network analysis. A natural approach in these cases is to replace \(P\) by an approximation \(\hat{P}\). Using theory from the stability of Markov chains we explore a variety of situations where it is possible to quantify how ‘close’ the chain given by the transition kernel \(\hat{P}\) is to the chain given by \(P\). We apply these results to several examples from spatial statistics and network analysis.
Similar content being viewed by others
References
Ahn, S., Korattikara, A., Welling, M.: Bayesian posterior sampling via stochastic gradient Fisher scoring. In: Proceedings of the 29th International Conference on Machine Learning. (2012)
Andrieu, C., Roberts, G.: The pseudo-marginal approach for efficient Monte-Carlo computations. Ann. Stat. 37(2), 697–725 (2009)
Andrieu, C., Vihola, M.: Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms. Preprint arXiv:1210.1484 (2012).
Bardenet, R., Doucet, A., Holmes, C.: Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In: Proceedings of the 31st International Conference on Machine Learning (2014)
Beaumont, M.A.: Estimation of population growth or decline in genetically monitored populations. Genetics 164, 1139–1160 (2003)
Besag, J.E.: Spatial Interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B 36, 192–236 (1974)
Bottou, L., Bousquet, O.: The tradeoffs of large-scale learning. In: Sra, S., Nowozin, S., Wright, S.J. (eds.) Optimization for Machine Learning, pp. 351–368. MIT Press, Cambridge (2011)
Bühlmann, P., Van de Geer, S.: Statistics for High-Dimensional Data. Springer, Berlin (2011)
Caimo, A., Friel, N.: Bayesian inference for exponential random graph models. Soc. Netw. 33, 41–55 (2011)
Dalalyan, A., Tsybakov, A.B.: Sparse regression learning by aggregation and Langevin. J. Comput. Syst. Sci. 78(5), 1423–1443 (2012)
Ferré, D., Hervé, L., Ledoux, J.: Regular perturbation of \(V\)-geometrically ergodic Markov chains. J. Appl. Probab. 50(1), 184–194 (2013)
Friel, N., Pettitt, A.N.: Likelihood estimation and inference for the autologistic model. J. Comput. Graph. Stat. 13, 232–246 (2004)
Friel, N., Pettitt, A.N., Reeves, R., Wit, E.: Bayesian inference in hidden Markov random fields for binary data defined on large lattices. J. Comput. Graph. Stat. 18, 243–261 (2009)
Friel, N., Rue, H.: Recursive computing and simulation-free inference for general factorizable models. Biometrika 94, 661–672 (2007)
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Gilks, W., Roberts, G., George, E.: Adaptive direction sampling. Statistician 43, 179–189 (1994)
Girolami, M., Calderhead, B.: Riemann manifold Langevin and Hamiltoian Monte Carlo methods (with discussion). J. R. Stat. Soc. Ser. B 73, 123–214 (2011)
Golub, G., Loan, C.V.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)
Kartashov, N.V.: Strong Stable Markov Chains. VSP, Utrecht (1996)
Korattikara, A., Chen Y., Welling, M.: Austerity in MCMC land: cutting the Metropolis–Hastings Budget. In: Proceedings of the 31st International Conference on Machine Learning, pp. 681–688 (2014)
Liang, F. Jin, I.-H.: An auxiliary variables Metropolis–Hastings algorithm for sampling from distributions with intractable normalizing constants. Technical report (2011)
Marin, J.-M., Pudlo, P., Robert, C.P., Ryder, R.J.: Approximate Bayesian computational methods. Stat. Comput. 22(6), 1167–1180 (2012)
Meyn, S., Tweedie, R.L.: Markov Chains and Stochastic Stability. Cambridge University Press, Cambridge (1993)
Mitrophanov, A.Y.: Sensitivity and convergence of uniformly ergodic Markov chains. J. Appl. Probab. 42, 1003–1014 (2005)
Møller, J., Pettitt, A.N., Reeves, R., Berthelsen, K.K.: An efficient Markov chain Monte-Carlo method for distributions with intractable normalizing constants. Biometrika 93, 451–458 (2006)
Murray, I., Ghahramani, Z., MacKay, D.: MCMC for doubly-intractable distributions. In: Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence UAI06, AUAI Press, Arlington, Virginia (2006)
Nicholls, G. K., Fox, C., Watt, A.M.: Coupled MCMC with a randomized acceptance probability. Preprint arXiv:1205.6857 (2012)
Propp, J., Wilson, D.: Exactly sampling with coupled Markov chains and applications to statistical mechanics. Random Struct. Algorithms 9, 223–252 (1996)
Reeves, R., Pettitt, A.N.: Efficient recursions for general factorisable models. Biometrika 91, 751–757 (2004)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Roberts, G.O., Stramer, O.: Langevin diffusions and Metropolis–Hastings algorithms. Methodol. Comput. Appl. Probab. 4, 337–357 (2002)
Roberts, G.O., Tweedie, R.L.: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996a)
Roberts, G.O., Tweedie, R.L.: Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithm. Biometrika 83(1), 95–110 (1996b)
Robins, G., Pattison, P., Kalish, Y., Lusher, D.: An introduction to exponential random graph models for social networks. Soc. Netw. 29(2), 169–348 (2007)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
Valiant, L.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
Welling, M., Teh, Y. W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning, pp. 681–688 (2011)
Acknowledgments
The Insight Centre for Data Analytics is supported by Science Foundation Ireland under Grant Number SFI/12/RC/2289. Nial Friel’s research was also supported by an Science Foundation Ireland grant: 12/IP/1424.
Author information
Authors and Affiliations
Corresponding author
Appendix: Proofs
Appendix: Proofs
Proof of Corollary 2.3
We apply Theorem 2.1. First, note that we have
and
So we can write
and, finally,
\(\square \)
Proof of Lemma 1
We still use Theorem 2.1, note that
Now, note that
where \(X\sim \mathcal {N}(0,I)\) and . Then:
So finally,
\(\square \)
Proof of Lemma 2
We only have to check that
\(\square \)
Proof of Theorem 3.1
Under the assumptions of Theorem 3.1, note that (4) leads to
Let us consider any measurable subset \(B\) of \(\varTheta \) and \(\theta \in \varTheta \). We have
This proves that \(\varTheta \) is a small set for the Lebesgue measure (multiplied by constant \(1/c_{\pi }^2 c_{h}^3 \mathcal {K}^4\)) on \(\varTheta \). According to Theorem 16.0.2 page 394 in Meyn and Tweedie (1993), this proves that:
where
(note that, by definition, \(\mathcal {K},c_\pi ,c_h>1\) so we necessarily have \(0<\rho <1\)). So, Condition (H1) in Lemma 2.3 is satisfied.
Moreover,
So, Condition (H2) in Lemma 2.3 is satisfied. We can apply this lemma and to give
with
with \(\lambda =\left\lceil \frac{\log (1/C)}{\log (\rho )} \right\rceil \). \(\square \)
Proof of Lemma 3
Note that
So we have to find an upper bound, uniformly over \(\theta \), for
Let us put \(V:=\frac{1}{N}\sum _{i=1}^N V^{(i)} := \frac{1}{N}\sum _{i=1}^{N} \Sigma ^{\frac{1}{2}} \{ s(y'_i) - \mathbb {E}_{y'\sim f_{\theta }}[s(y')]\}\) and denote \(V_j\) (\(j=1,\dots ,k\)) the coordinates of \(V\), and \(V_j^{(i)}\) (\(j=1,\dots ,k\)) the coordinates of \(V^{(i)}\). We have
Now, remark that \(V_j=\frac{1}{N}\sum _{i=1}^{N} V_j^{(i)}\) with \(-\mathcal {S} \Vert \Sigma \Vert \le V_j^i \le \mathcal {S} \Vert \Sigma \Vert \) so, Hoeffding’s inequality ensures, for any \(t\ge 0\),
As a consequence, for any \(\tau >0\),
where \(\mathcal {N}\sim \mathcal {N}(0,1)\). Now, we assume that \(N>4 k \mathcal {S}^2 \Vert \Sigma \Vert ^2\). This leads to \(\frac{1}{\mathcal {S}^2\Vert \Sigma \Vert ^2}-\frac{2k}{N} >\frac{1}{2\mathcal {S}^2 \Vert \Sigma \Vert ^2}\). This simplifies the bound to
Finally, we put \(\tau =\sqrt{\log (N/k)/(2\mathcal {S}^2 \Vert \Sigma \Vert ^2)}\) to get
It follows that
This ends the proof. \(\square \)
Proof of Lemma 3.2
We just check all the conditions of Theorem 2.2. First, from Lemma 3, we know that \(\Vert P_{\Sigma }-\hat{P}_{\Sigma }| \le \sqrt{\delta /2}\rightarrow 0\) when \(N\rightarrow \infty \). Then, we have to find the function \(V\). Note that here:
Then, according to Theorem 3.1 page 352 in Roberts and Tweedie (1996a) (and its proof), we know that for \(\Sigma <s^2\), for some positive numbers \(a\) and \(b\), for \(V(\theta )=a\theta \) when \(\theta \ge 0\) and \(V(\theta )=-b\theta \) for \(\theta <0\), there is a \(0<\delta <1\), \(\beta >0\) and an inverval \(I\) with
and so \(P_{\Sigma }\) is geometrically ergodic with function \(V\). We calculate
and:
So,
So all the assumptions of Theorem 2.2 are satisfied, and we can conclude that \( \Vert \pi _{\Sigma }-\pi _{\Sigma ,N}\Vert \xrightarrow [N\rightarrow \infty ]{} 0\). \(\square \)
Rights and permissions
About this article
Cite this article
Alquier, P., Friel, N., Everitt, R. et al. Noisy Monte Carlo: convergence of Markov chains with approximate transition kernels. Stat Comput 26, 29–47 (2016). https://doi.org/10.1007/s11222-014-9521-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-014-9521-x