Embedded topics in the stochastic block model

Boutin, Rémi; Bouveyron, Charles; Latouche, Pierre

doi:10.1007/s11222-023-10265-9

Embedded topics in the stochastic block model

Original Paper
Published: 01 July 2023

Volume 33, article number 95, (2023)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Rémi Boutin¹,
Charles Bouveyron² &
Pierre Latouche^1,3

166 Accesses
Explore all metrics

Abstract

Communication networks such as emails or social networks are now ubiquitous and their analysis has become a strategic field. In many applications, the goal is to automatically extract relevant information by looking at the nodes and their connections. Unfortunately, most of the existing methods focus on analysing the presence or absence of edges and textual data is often discarded. However, all communication networks actually come with textual data on the edges. In order to take into account this specificity, we consider in this paper networks for which two nodes are linked if and only if they share textual data. We introduce a deep latent variable model allowing embedded topics to be handled called ETSBM to simultaneously perform clustering on the nodes while modelling the topics used between the different clusters. ETSBM extends both the stochastic block model (SBM) and the embedded topic model (ETM) which are core models for studying networks and corpora, respectively. The inference is done using a variational-Bayes expectation-maximisation algorithm combined with a stochastic gradient descent. The methodology is evaluated on synthetic data and on a real world dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Link prediction in social networks using hyper-motif representation on hypergraph

Article 12 April 2024

The homophily principle in social network analysis: A survey

Article 18 January 2022

References

Airoldi, E.M., Blei, D.M., Fienberg, S.E., Xing, E.P.: Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9, 1981–2014 (2008)
Attias, H.: A variational Baysian framework for graphical models. Adv. Neural Inf. Process. Syst. 12, 209 (1999)
Bergé, L.R., Bouveyron, C., Corneli, M., Latouche, P.: The latent topic block model for the co-clustering of textual interaction data. Comput. Stat. Data Anal. 137, 247–270 (2019)
Article MathSciNet MATH Google Scholar
Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural. Inf. Process. Syst. 18, 147 (2006)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bouveyron, C., Latouche, P., Zreik, R.: The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat. Comput. 28, 11–31 (2018)
Article MathSciNet MATH Google Scholar
Cème, E., Latouche, P.: Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood. Stat. Model. 15, 564–589 (2015). https://doi.org/10.1177/1471082X15577017
Article MathSciNet MATH Google Scholar
Corneli, M., Latouche, P., Rossi, F.: Block modelling in dynamic networks with non-homogeneous Poisson processes and exact ICL. Soc. Netw. Anal. Min. 6, 1–14 (2016)
Article Google Scholar
Corneli, M., Bouveyron, C., Latouche, P., Rossi, F.: The dynamic stochastic topic block model for dynamic networks with textual edges. Stat. Comput. 29, 677–695 (2019)
Article MathSciNet MATH Google Scholar
Daudin, J.-J., Picard, F., Robin, S.: A mixture model for random graphs. Research Report RR-5840 INRIA (2006)
Daudin, J.-J., Picard, F., Robin, S.: A mixture model for random graphs. Stat. Comput. 18, 173–183 (2008)
Article MathSciNet Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)
Article Google Scholar
Dieng, A.B., Ruiz, F.J., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020)
Article Google Scholar
Erdos, P., Rényi, A., et al.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5, 17–60 (1960)
MathSciNet MATH Google Scholar
Fienberg, S.E., Wasserman, S.S.: Categorical data analysis of single sociometric relations. Sociol. Methodol. 12, 156–192 (1981)
Article Google Scholar
Gershman, S., Goodman, N.: Amortized inference in probabilistic reasoning. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 36 (2014)
Goldenberg, A., Zheng, A.X., Fienberg, S.E., Airoldi, E.M., et al.: A survey of statistical network models. Found. Trends Mach. Learn. 2, 129–233 (2010)
Article MATH Google Scholar
Gopalan, P.K., Blei, D.M.: Efficient discovery of overlapping communities in massive networks. Proc. Natl. Acad. Sci. 110, 14534–14539 (2013)
Article MathSciNet MATH Google Scholar
Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social networks. J. R. Stat. Soc. A. Stat. Soc. 170, 301–354 (2007)
Article MathSciNet Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: UAI (1999)
Jernite, Y., Latouche, P., Bouveyron, C., Rivera, P., Jegou, L., Lamassé, S.: The random subgraph model for the analysis of an ecclesiastical network in Merovingian Gaul. Ann. Appl. Stat. 8, 377–405 (2014)
Article MathSciNet MATH Google Scholar
Jouvin, N., Latouche, P., Bouveyron, C., Bataillon, G., Livartowski, A.: Greedy clustering of count data through a mixture of multinomial PCA. Comput. Stat. 36, 1–33 (2021)
Article MathSciNet MATH Google Scholar
Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., Ueda, N.: Learning systems of concepts with an infinite relational model. In: AAAI, vol. 3, p. 5 (2006)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv:1312.6114 (2014)
Latouche, P., Birmelé, E., Ambroise, C.: Overlapping stochastic block models with application to the French political blogosphere. Ann. Appl. Stat. 5, 309–336 (2011)
Article MathSciNet MATH Google Scholar
Latouche, P., Birmele, E., Ambroise, C.: Variational Bayesian inference and complexity control for stochastic block models. Stat. Model. 12, 93–115 (2012)
Article MathSciNet MATH Google Scholar
Laurent, S.: Comment la gauche sociale-démocrate a perdu la bataille des réseaux sociaux. Le Monde. (2022). https://www.lemonde.fr/politique/article/2022/03/31/comment-la-gauche-sociale-democrate-a-perdu-la-bataille-des-reseaux-sociaux_6119986_823448.html
Lee, C., Wilkinson, D.J.: A review of stochastic block models and extensions for graph clustering. Appl. Netw. Sci. 4, 1–50 (2019)
Article Google Scholar
Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link lda: joint models of topic and author community. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 665–672 (2009)
Mariadassou, M., Robin, S., Vacher, C.: Uncovering latent structure in valued graphs: a variational approach. Ann. Appl. Stat. 4, 715–742 (2010)
Article MathSciNet MATH Google Scholar
Matias, C., Miele, V.: Statistical clustering of temporal networks through a dynamic stochastic block model. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79, 1119–1141 (2017)
Article MathSciNet MATH Google Scholar
Matias, C., Robin, S.: Modeling heterogeneity in random graphs through latent space models: a selective review. ESAIM Proc. Surv. 47, 55–74 (2014)
Article MathSciNet MATH Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Nowicki, K., Snijders, T.A.B.: Estimation and prediction for stochastic blockstructures. J. Am. Stat. Assoc. 96, 1077–1087 (2001)
Article MathSciNet MATH Google Scholar
Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: a probabilistic analysis, pp. 159–168. ACM Press, New York (1998)
MATH Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Pathak, N., Delong, C., Erickson, K., Banerjee, A.: Social topic models for community extraction. In: The 2nd SNA-KDD Workshop (2008)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: International Conference on Machine Learning, pp. 1278–1286. PMLR (2014)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence UAI ’04 AUAI Press, pp. 487–494 (2004)
Sachan, M., Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Using content and interactions for discovering communities in social networks. In: Proceedings of the 21st International Conference on World Wide Web, pp. 331–340 (2012)
Sampson, S.F.: Crisis in a cloister. Ph.D. thesis, Cornell University, Ithaca (1969)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (2000)
Article Google Scholar
Srivastava, A., Sutton, C.: Autoencoding variational inference for topic models. In: ICLR (2017)
Vayansky, I., Kumar, S.A.: A review of topic modeling methods. Inf. Syst. 94, 101582 (2020)
Article Google Scholar
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007)
Article MathSciNet Google Scholar
Wang, Y.J., Wong, G.Y.C.: Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 82, 8–19 (1987)
Article MathSciNet MATH Google Scholar
Zanghi, H., Volant, S., Ambroise, C.: Clustering based on random graph model embedding vertex features. Pattern Recogn. Lett. 31, 830–836 (2010). https://doi.org/10.1016/j.patrec.2010.01.026
Article Google Scholar
Zhou, D., Manavoglu, E., Li, J., Giles, C. L., Zha, H.: Probabilistic models for discovering e-communities. In: Proceedings of the 15th International Conference on World Wide Web, pp. 173–182 (2006)
Zreik, R., Latouche, P., Bouveyron, C.: The dynamic random subgraph model for the clustering of evolving networks. Comput. Stat. 32, 501–533 (2017)
Article MathSciNet MATH Google Scholar

Download references

Funding

This work was supported by a Doctoral grant accorded by Université Paris Cité and by the French government, through the 3IA Côte d’Azur, Investment in the Future, project managed by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002.

Author information

Authors and Affiliations

CNRS, Laboratoire MAP5, UMR 8145, Université Paris Cité, Paris, France
Rémi Boutin & Pierre Latouche
CNRS, Laboratoire J.A Dieudonné, INRIA, MAASAI Team, Université Côte d’azur, Sophia-Antipolis, France
Charles Bouveyron
CNRS, Laboratoire LMBP, UMR 6620, Université Clermont Auvergne, Aubière, France
Pierre Latouche

Authors

Rémi Boutin
View author publications
You can also search for this author in PubMed Google Scholar
Charles Bouveyron
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Latouche
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RB wrote the main manuscript, PL co-wrote section 5, All authors reviewed the manuscript.

Corresponding author

Correspondence to Rémi Boutin.

Ethics declarations

Conflict of interest

The authors declare no cnflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Inference

Proof of Proposition 4.1

The ELBO can be decomposed as follow:

$$\begin{aligned}{} & {} \log p(A, W \mid \alpha , \rho ) \\{} & {} \quad = {\mathbb {E}}_{R} \left[ \log p(A, W \mid \alpha , \rho )\right] \\{} & {} \quad = {\mathbb {E}}_{R} \left[ \log \frac{p(A, W, Y, \pi , \gamma , \delta \mid \alpha , \rho )}{p(Y, \pi , \gamma , \delta \mid A, W, \alpha , \rho )} \right] \\{} & {} \qquad \text {applying Bayes rule}\\{} & {} \quad = {\mathbb {E}}_{R} \left[ \log \frac{p(A, W, Y, \pi , \gamma , \delta \mid \alpha , \rho )}{R(Y, \pi , \gamma , \delta )}\right. \\{} & {} \qquad \quad \left. + \log \frac{R(Y, \pi , \gamma , \delta )}{p(Y, \pi , \gamma , \delta \mid A, W, \alpha , \rho )} \right] \\{} & {} \quad = \mathscr {L} (R(\cdot ); \alpha , \rho ) + {{\,\textrm{KL}\,}}(R(\cdot ) || p(Y, \pi , \gamma , \delta \mid A, W, \alpha , \rho )). \end{aligned}$$

$\square $

Proof of Proposition 4.2

$$\begin{aligned}{} & {} \mathscr {L}(R(\cdot ); \alpha , \rho )\nonumber \\{} & {} \quad = \overset{\mathscr {L}^{net}(\tau , \tilde{\pi }_{qr1}, \tilde{\pi }_{qr2} \tilde{\gamma } ; \alpha , \rho ) :=}{\overbrace{{\mathbb {E}}_{R}\left[ \log \frac{p( W \mid Y, A, \theta , \alpha , \rho ) p(\theta )}{ R(\theta )}\right] }}\nonumber \\{} & {} \qquad +\overset{\mathscr {L}^{texts}(\tau , \nu ; \alpha , \rho ) :=}{\overbrace{{\mathbb {E}}_{R} \left[ \log \frac{ p(A \mid Y, \pi ) p(Y \mid \gamma ) p(\pi ) p(\gamma )}{ R(Y) R(\pi ) R(\gamma )} \right] }} \nonumber \\{} & {} \quad = {\mathbb {E}}_{R}\left[ \log p( W \mid Y, A, \theta , \alpha , \rho )\right] +{\mathbb {E}}_{R}\left[ \log p(\theta ) \right] \nonumber \\{} & {} \qquad -{\mathbb {E}}_{R}\left[ \log R(\theta ) \right] \nonumber \\{} & {} \qquad + {\mathbb {E}}_{R}\left[ \log p(A \mid Y, \pi ) \right] +{\mathbb {E}}_{R}\left[ \log p(Y \mid \gamma ) \right] \nonumber \\{} & {} \qquad +{\mathbb {E}}_{R}\left[ \log p(\pi ) \right] + {\mathbb {E}}_{R} \left[ \log p(\gamma ) \right] \nonumber \\{} & {} \qquad - {\mathbb {E}}_{R}\left[ \log R(Y) \right] -{\mathbb {E}}_{R}\left[ \log R(\pi ) \right] - {\mathbb {E}}_{R} \left[ \log R(\gamma ) \right] \nonumber \\{} & {} \quad = \sum _{i \ne j }^M \sum _{ q, r}^Q A_{ij} \tau _{iq} \tau _{jr} {\mathbb {E}}_{R}\left[ \underset{T_{ij}^{\delta _{qr}}}{\underbrace{\log p(w_{ij} \mid \delta _{qr}, \alpha , \rho )}} \right] \nonumber \\{} & {} \qquad -\sum _{q,r} {{\,\textrm{KL}\,}}( {\mathcal {N}}(\mu _{qr}(\tau , \nu ), \sigma _{qr} (\tau ,\nu )) || {\mathcal {N}}(0, I) ) \nonumber \\{} & {} \qquad + \sum _{i \ne j}^M \sum _{q,r}^Q \tau _{iq} \tau _{jr} A_{ij} \left( \psi ( \kappa _{qr1}) - \psi ( \kappa _{qr2}) \right) \nonumber \\{} & {} \qquad +\sum _{i \ne j}^M \sum _{q,r}^Q\tau _{iq} \tau _{jr} (\psi (\kappa _{qr2} ) -\psi (\kappa _{qr1} + \kappa _{qr2} )) \nonumber \\{} & {} \qquad + \sum _{i=1}^M \sum _{q=1}^Q \tau _{iq} \left( \psi (\gamma _{q}) -\psi \left( \sum _{q} \gamma _{q}\right) \right) \nonumber \\{} & {} \qquad + \log {\mathcal {B}}(1_Q) + \log ({\mathcal {B}}(a,b) ) \nonumber \\{} & {} \qquad - \sum _{i=1}^M \sum _{q=1}^Q \tau _{iq} \log (\tau _{iq})\nonumber \\{} & {} \qquad -\sum _{q,r} \log {\mathcal {B}} (\kappa _{qr1}, \kappa _{qr2})-\log {\mathcal {B}}(\gamma ). \end{aligned}$$

(A.1)

where,

$$\begin{aligned} T_{ij}^{\delta _{qr}} = \sum _{d=1}^{D_{ij}} \sum _{n=1}^{N_{id}^d} \sum _{v=1}^V w_{ij}^{dnv} \log \left( \sum _{k=1}^K \theta _{qr k} \beta _{k v} \right) . \end{aligned}$$

(A.2)

and $\theta _{qr} = \mu _{qr}(\tau , \nu ) + \sigma _{qr}(\tau , \nu ) \epsilon $, $\epsilon \sim {\mathcal {N}}(0_K, \textrm{I}_{K})$.

The Kullback–Leibler divergence between two Gaussian variables has a close form and is easy to compute. All the terms can be computed except for the expectation of $ T_{ij}^{\delta _{qr}}$ that can be approximated using a Monte-Carlo estimator, by drawing S samples for each pair (q, r), such that:

$$\begin{aligned}&\epsilon ^s \sim {\mathcal {N}}(0,I_{K}), \ \ \ \delta _{qr}^s = \mu _{qr}( \tau , \nu ) +\sigma _{qr}( \tau , \nu ) \odot \epsilon ^s,\\&\theta _{qr}^s = {{\,\textrm{softmax}\,}}(\delta _{qr}^s). \end{aligned}$$

with $ \odot $ denoting the Hadamard product. Thus, for each pair of nodes (i, j) and pair of clusters (q, r), the estimate is given by:

$$\begin{aligned} \hat{T}_{ij}^{qr} = S^{-1} \sum _{s=1}^S T_{ij}^{\delta ^s_{qr}}. \end{aligned}$$

Plugging $\hat{T}_{ij}^{qr}$ in the Eq. (A.1) gives the final estimator of the ELBO. $\square $

Figure 12 provides a translation of topics found by ETSBM on the real dataset and appearing in the meta-network.

Appendix B: Real data

Figure 13 provides a translation of topics found by ETM on the real dataset and appearing in the meta-network.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Boutin, R., Bouveyron, C. & Latouche, P. Embedded topics in the stochastic block model. Stat Comput 33, 95 (2023). https://doi.org/10.1007/s11222-023-10265-9

Download citation

Received: 13 September 2022
Accepted: 13 June 2023
Published: 01 July 2023
DOI: https://doi.org/10.1007/s11222-023-10265-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Embedded topics in the stochastic block model

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Link prediction in social networks using hyper-motif representation on hypergraph

The homophily principle in social network analysis: A survey

References

Funding