Abstract
In many real world networks, a vertex is usually associated with a transaction database that comprehensively describes the behaviour of the vertex. A typical example is a social network, where the behaviours of every user are depicted by a transaction database that stores her daily posted contents. Specifically, a transaction database consists of a collection of transactions, where each transaction corresponds to a piece of tweet. For each transaction, it consists of a set of items, where each item may correspond to a keyword or a piece of video clip contained in this tweet. To model such type of scenario, we propose the novel notion of the transaction database graph, where each vertex is associated with a transaction database. Every path of the graph is a sequence of vertices that induces multiple sequences of transactions. The sequences of transactions induced by all of the paths in the graph form an extremely large sequence database. Finding frequent sequential patterns from such sequence database discovers interesting subsequences that frequently appear in many paths of the network. Our goal is to find the top-k frequent sequential patterns in the sequence database induced from a transaction database graph. However, it is challenging since the sequence database induced by a transaction database graph is too large to be explicitly induced and stored, and finding the top-k frequent sequential patterns is #P-hard. To tackle this problem, we propose an efficient two-step sampling algorithm that approximates the top-k frequent sequential patterns with the provable quality guarantee. Extensive experimental results on synthetic and real-world data sets demonstrate the effectiveness and efficiency of our method.
Similar content being viewed by others
References
Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the 11th International Conference on Data Engineering, ICDE’95, pp. 3–14 (1995)
Bartlett, P.L., Boucheron, S., Lugosi, G.: Model selection and error estimation. Mach. Learn. 48(1-3), 85–113 (2002)
Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probabilita. Libreria internazionale Seeber (1936)
Calders, T., Garboni, C., Goethals, B.: Efficient pattern mining of uncertain data with sampling. In: Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’10, pp. 480–487 (2010)
Cherkassky, V.: The nature of statistical learning theory. IEEE Trans. Neural Netw. 8(6), 1564 (1997)
Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23(4), 493–507 (1952)
Cochran, W.G.: Sampling techniques, 3rd. Wiley, New York (1977)
Dong, G., Pei, J.: Sequence data mining. Springer, Berlin (2007)
Dutta, S., Nayek, P., Bhattacharya, A.: Neighbor-aware search for approximate labeled graph matching using the chi-square statistics. In: Proceedings of the 26th International Conference on World Wide Web, WWW’17, pp. 1281–1290 (2017)
Fournier-Viger, P., Gomariz, A., Gueniche, T., Mwamikazi, E.T.: Tks: efficient mining of top-k sequential patterns. In: Proceedings of the 9th International Conference on Advanced Data Mining and Applications, ADMA’13, pp. 109–120 (2013)
Ge, J., Xia, Y.: Distributed sequential pattern mining in large scale uncertain databases. In: Proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’16, pp. 17–29 (2016)
Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, M.: Freespan: Frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’00, pp. 355–359 (2000)
Huang, D., Xu, K., Pei, J.: Malicious url detection by dynamically mining patterns without pre-defined elements. World Wide Web Journal 17(6), 1375–1394 (2014)
Kimura, M., Saito, K.: Tractable models for information diffusion in social networks. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD’06, pp. 259–271 (2006)
Leskovec, J., Faloutsos, C.: Neighbor-aware search for approximate labeled graph matching using the chi-square statistics. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’06, pp. 631–636 (2006)
Liu, C., Zhang, K., Xiong, H., Jiang, G., Yang, Q.: Temporal skeletonization on sequential data: Patterns, categorization, and visualization. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’14, pp. 1336–1345 (2014)
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Prefixspan: mining sequential patterns by prefix-projected growth. In: Proceedings of the 17th International Conference on Data Engineering, ICDE’01, pp. 215–224 (2001)
Pfeiffer, J.J., Moreno, S., Fond, T.L., Neville, J., Gallagher, B.: Attributed graph models: modeling network structure with correlated attributes. In: Proceedings of the 23rd International Conference on World Wide Web, WWW’14, pp. 831–842 (2014)
Pietracaprina, A., Riondato, M., Upfal, E., Vandin, F.: Mining top-k frequent itemsets through progressive sampling. Data Min. Knowl. Disc. 21(2), 310–326 (2010)
Raïssi, C., Poncelet, P.: Sampling for sequential pattern mining: From static databases to data streams. In: Proceedings of the 7th IEEE International Conference on Data Mining, ICDM’07, pp. 631–636 (2007)
Ribeiro, B.F., Wang, P., Murai, F., Towsley, D.: Sampling directed graphs with random walks. In: Proceedings of the IEEE International Conference on Computer Communications, INFOCOM’12, pp. 1692–1700 (2012)
Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD’12, pp. 25–41 (2012)
Riondato, M., Upfal, E.: Mining frequent itemsets through progressive sampling with rademacher averages. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’15, pp. 1005–1014 (2015)
Shang, J., Peng, J., Han, J.: Macfp: maximal approximate consecutive frequent pattern mining under edit distance. In: Proceedings of the 2016 SIAM International Conference on Data Mining, SDM’16, pp. 558–566 (2016)
Singhal, A.: Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4), 35–43 (2001)
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th International Conference on Extending Database Technology, EDBT’96, pp. 3–17 (1996)
Tang, J., Zhang, J., Yao, L., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’08, pp. 990–998 (2008)
Thompson, S.K.: Sampling, 3rd. Wiley, New York (2012)
Toivonen, H.: Sampling large databases for association rules. Proceedings of the Vldb Endowment 96, 134–145 (1996)
Tong, H., Faloutsos, C., Gallagher, B., Eliassi-Rad, T.: Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’07, pp. 737–746 (2007)
Tzvetkov, P., Yan, X., Han, J.: Tsp: mining top-k closed sequential patterns. Knowl. Inf. Syst. 7(4), 438–457 (2005)
Wang, X., Lin, J., Senin, P., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Rpm: representative pattern mining for efficient time series classification. In: Proceedings of the 19th International Conference on Extending Database Technology, EDBT’16, pp. 185–196 (2016)
Ye, W., Zhou, L., Mautz, D., Plant, C., Böhm, C.: Learning from labeled and unlabeled vertices in networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’17, pp. 1265–1274 (2017)
Zaki, M.J.: Spade: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1/2), 31–60 (2001)
Zhang, J., Tang, J., Ma, C., Tong, H., Jing, Y., Li, J.: Panther: fast top-k similarity search on large networks. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’15, pp. 1445–1454 (2015)
Zheng, Z., Wei, W., Liu, C., Cao, W., Cao, L., Bhatia, M.: An effective contrast sequential pattern mining approach to taxpayer behavior analysis. In: Proceedings of the 25th International Conference on World Wide Web, WWW’16, pp. 633–651 (2016)
Acknowledgments
This work was supported in part by the National Key Research and Development Program of China (No. 2017YFB0803301), the Natural Science Foundation of China (No. U1836215), DongGuan Innovative Research Team Program (No.201636000100038), and the 111 Project (No. B18008).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lei, M., Chu, L., Wang, Z. et al. Mining top-k sequential patterns in transaction database graphs. World Wide Web 23, 103–130 (2020). https://doi.org/10.1007/s11280-019-00686-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-019-00686-w