Abstract
Communication networks such as emails or social networks are now ubiquitous and their analysis has become a strategic field. In many applications, the goal is to automatically extract relevant information by looking at the nodes and their connections. Unfortunately, most of the existing methods focus on analysing the presence or absence of edges and textual data is often discarded. However, all communication networks actually come with textual data on the edges. In order to take into account this specificity, we consider in this paper networks for which two nodes are linked if and only if they share textual data. We introduce a deep latent variable model allowing embedded topics to be handled called ETSBM to simultaneously perform clustering on the nodes while modelling the topics used between the different clusters. ETSBM extends both the stochastic block model (SBM) and the embedded topic model (ETM) which are core models for studying networks and corpora, respectively. The inference is done using a variational-Bayes expectation-maximisation algorithm combined with a stochastic gradient descent. The methodology is evaluated on synthetic data and on a real world dataset.
Similar content being viewed by others
References
Airoldi, E.M., Blei, D.M., Fienberg, S.E., Xing, E.P.: Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9, 1981–2014 (2008)
Attias, H.: A variational Baysian framework for graphical models. Adv. Neural Inf. Process. Syst. 12, 209 (1999)
Bergé, L.R., Bouveyron, C., Corneli, M., Latouche, P.: The latent topic block model for the co-clustering of textual interaction data. Comput. Stat. Data Anal. 137, 247–270 (2019)
Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural. Inf. Process. Syst. 18, 147 (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bouveyron, C., Latouche, P., Zreik, R.: The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat. Comput. 28, 11–31 (2018)
Cème, E., Latouche, P.: Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood. Stat. Model. 15, 564–589 (2015). https://doi.org/10.1177/1471082X15577017
Corneli, M., Latouche, P., Rossi, F.: Block modelling in dynamic networks with non-homogeneous Poisson processes and exact ICL. Soc. Netw. Anal. Min. 6, 1–14 (2016)
Corneli, M., Bouveyron, C., Latouche, P., Rossi, F.: The dynamic stochastic topic block model for dynamic networks with textual edges. Stat. Comput. 29, 677–695 (2019)
Daudin, J.-J., Picard, F., Robin, S.: A mixture model for random graphs. Research Report RR-5840 INRIA (2006)
Daudin, J.-J., Picard, F., Robin, S.: A mixture model for random graphs. Stat. Comput. 18, 173–183 (2008)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)
Dieng, A.B., Ruiz, F.J., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020)
Erdos, P., Rényi, A., et al.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5, 17–60 (1960)
Fienberg, S.E., Wasserman, S.S.: Categorical data analysis of single sociometric relations. Sociol. Methodol. 12, 156–192 (1981)
Gershman, S., Goodman, N.: Amortized inference in probabilistic reasoning. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 36 (2014)
Goldenberg, A., Zheng, A.X., Fienberg, S.E., Airoldi, E.M., et al.: A survey of statistical network models. Found. Trends Mach. Learn. 2, 129–233 (2010)
Gopalan, P.K., Blei, D.M.: Efficient discovery of overlapping communities in massive networks. Proc. Natl. Acad. Sci. 110, 14534–14539 (2013)
Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social networks. J. R. Stat. Soc. A. Stat. Soc. 170, 301–354 (2007)
Hofmann, T.: Probabilistic latent semantic analysis. In: UAI (1999)
Jernite, Y., Latouche, P., Bouveyron, C., Rivera, P., Jegou, L., Lamassé, S.: The random subgraph model for the analysis of an ecclesiastical network in Merovingian Gaul. Ann. Appl. Stat. 8, 377–405 (2014)
Jouvin, N., Latouche, P., Bouveyron, C., Bataillon, G., Livartowski, A.: Greedy clustering of count data through a mixture of multinomial PCA. Comput. Stat. 36, 1–33 (2021)
Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., Ueda, N.: Learning systems of concepts with an infinite relational model. In: AAAI, vol. 3, p. 5 (2006)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv:1312.6114 (2014)
Latouche, P., Birmelé, E., Ambroise, C.: Overlapping stochastic block models with application to the French political blogosphere. Ann. Appl. Stat. 5, 309–336 (2011)
Latouche, P., Birmele, E., Ambroise, C.: Variational Bayesian inference and complexity control for stochastic block models. Stat. Model. 12, 93–115 (2012)
Laurent, S.: Comment la gauche sociale-démocrate a perdu la bataille des réseaux sociaux. Le Monde. (2022). https://www.lemonde.fr/politique/article/2022/03/31/comment-la-gauche-sociale-democrate-a-perdu-la-bataille-des-reseaux-sociaux_6119986_823448.html
Lee, C., Wilkinson, D.J.: A review of stochastic block models and extensions for graph clustering. Appl. Netw. Sci. 4, 1–50 (2019)
Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link lda: joint models of topic and author community. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 665–672 (2009)
Mariadassou, M., Robin, S., Vacher, C.: Uncovering latent structure in valued graphs: a variational approach. Ann. Appl. Stat. 4, 715–742 (2010)
Matias, C., Miele, V.: Statistical clustering of temporal networks through a dynamic stochastic block model. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79, 1119–1141 (2017)
Matias, C., Robin, S.: Modeling heterogeneity in random graphs through latent space models: a selective review. ESAIM Proc. Surv. 47, 55–74 (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Nowicki, K., Snijders, T.A.B.: Estimation and prediction for stochastic blockstructures. J. Am. Stat. Assoc. 96, 1077–1087 (2001)
Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: a probabilistic analysis, pp. 159–168. ACM Press, New York (1998)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Pathak, N., Delong, C., Erickson, K., Banerjee, A.: Social topic models for community extraction. In: The 2nd SNA-KDD Workshop (2008)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: International Conference on Machine Learning, pp. 1278–1286. PMLR (2014)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence UAI ’04 AUAI Press, pp. 487–494 (2004)
Sachan, M., Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Using content and interactions for discovering communities in social networks. In: Proceedings of the 21st International Conference on World Wide Web, pp. 331–340 (2012)
Sampson, S.F.: Crisis in a cloister. Ph.D. thesis, Cornell University, Ithaca (1969)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (2000)
Srivastava, A., Sutton, C.: Autoencoding variational inference for topic models. In: ICLR (2017)
Vayansky, I., Kumar, S.A.: A review of topic modeling methods. Inf. Syst. 94, 101582 (2020)
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007)
Wang, Y.J., Wong, G.Y.C.: Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 82, 8–19 (1987)
Zanghi, H., Volant, S., Ambroise, C.: Clustering based on random graph model embedding vertex features. Pattern Recogn. Lett. 31, 830–836 (2010). https://doi.org/10.1016/j.patrec.2010.01.026
Zhou, D., Manavoglu, E., Li, J., Giles, C. L., Zha, H.: Probabilistic models for discovering e-communities. In: Proceedings of the 15th International Conference on World Wide Web, pp. 173–182 (2006)
Zreik, R., Latouche, P., Bouveyron, C.: The dynamic random subgraph model for the clustering of evolving networks. Comput. Stat. 32, 501–533 (2017)
Funding
This work was supported by a Doctoral grant accorded by Université Paris Cité and by the French government, through the 3IA Côte d’Azur, Investment in the Future, project managed by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002.
Author information
Authors and Affiliations
Contributions
RB wrote the main manuscript, PL co-wrote section 5, All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no cnflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Inference
Proof of Proposition 4.1
The ELBO can be decomposed as follow:
\(\square \)
Proof of Proposition 4.2
where,
and \(\theta _{qr} = \mu _{qr}(\tau , \nu ) + \sigma _{qr}(\tau , \nu ) \epsilon \), \(\epsilon \sim {\mathcal {N}}(0_K, \textrm{I}_{K})\).
The Kullback–Leibler divergence between two Gaussian variables has a close form and is easy to compute. All the terms can be computed except for the expectation of \( T_{ij}^{\delta _{qr}}\) that can be approximated using a Monte-Carlo estimator, by drawing S samples for each pair (q, r), such that:
with \( \odot \) denoting the Hadamard product. Thus, for each pair of nodes (i, j) and pair of clusters (q, r), the estimate is given by:
Plugging \(\hat{T}_{ij}^{qr}\) in the Eq. (A.1) gives the final estimator of the ELBO. \(\square \)
Figure 12 provides a translation of topics found by ETSBM on the real dataset and appearing in the meta-network.
Appendix B: Real data
Figure 13 provides a translation of topics found by ETM on the real dataset and appearing in the meta-network.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Boutin, R., Bouveyron, C. & Latouche, P. Embedded topics in the stochastic block model. Stat Comput 33, 95 (2023). https://doi.org/10.1007/s11222-023-10265-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10265-9