Abstract
We have witnessed the proliferation of the Internet over the past few decades. A large amount of textual information is generated on the Web. It is impossible to locate and digest all the latest updates available on the Web for individuals. Text summarization would provide an efficient way to generate short, concise abstracts from the massive documents. These massive documents involve many events which are hard to be identified by the summarization procedure directly. We propose a novel methodology that identifies events from these text corpora and creates summarization for each event. We employ a probabilistic, topic model to learn the potential topics from the massive documents and further discover events in terms of the topic distributions of documents. To target the summarization, we define the word set coverage problem (WSCP) to capture the most representative sentences to summarize an event. For getting solution of the WSCP, we propose an approximate algorithm to solve the optimization problem. We conduct a set of experiments to evaluate our proposed approach on two real datasets: Sina news and Johnson & Johnson medical news. On both datasets, our proposed method outperforms competitive baselines by considering the harmonic mean of coverage and conciseness.
Similar content being viewed by others
Notes
References
Ablanedo-Rosas Rego (2010) Surrogate constraint normalization for the set covering problem. Eur J Oper Res 205:540–551
Alguliev RM, Aliguliyev RM, Hajirahimova MS, Mehdiyev CA (2011) Mcmr: maximum coverage and minimum redundant text summarization model. Expert Syst Appl 38:14514–14522
Avella P, Boccia M, Vasilyev I (2009) Computational experience with general cutting planes for the set covering problem. Oper Res Lett 37:16–20
Balas Carrera (1996) A dynamic subgradient-based branch-and-bound procedure for set covering. Oper Res 44:875–890
Becker H, Naaman M, Gravano L (2010) Learning similarity metrics for event identification in social media. In: Proceedings of the third ACM international conference on Web search and data mining, ACM, pp 291–300
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Caprara A, Fischetti M, Toth P (1999) Aheuristic method for the set covering problem. Oper Res 47:730–743
Caragiannis I, Kaklamanis C, Kyropoulou M (2013) Tight approximation bounds for combinatorial frugal coverage algorithms. J Comb Optim 26:292–309
Chakrabarti D, Punera K (2011) Event summarization using tweets. In: ICWSM
Chieu HL, Ng HT (2002) A maximum entropy approach to information extraction from semi-structured and free text. In: Proceedings of the eighteenth national conference on artificial intelligence and fourteenth conference on innovative applications of artificial intelligence, Edmonton, Alberta, Canada. pp 786–791, 28 July–1 August 2002
Conroy JM, O’leary DP (2001) Text summarization via hidden markov models. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 406–407
Das D, Martins AF (2007) A survey on automatic text summarization. Lit Surv Lang Stat Course CMU 4:192–195
Deng G, Lin W (2011) Ant colony optimization-based algorithm for airline crew scheduling problem. Expert Syst Appl 38:5787–579
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96:226–231
Fattah MA, Ren F (2008) Automatic text summarization. World Acad Sci Eng Technol 37:2008
Fisher Kan R (1988) The design, analysis and implementation of heuristics. Manag Sci 34:263–265
Friedman JH (1997) On bias, variance, 0/1loss, and the curse-of-dimensionality. Data Min Knowl Discov 1:55–77
García-Hernández RA, Ledeneva Y (2009) Word sequence models for single text summarization. In: Advances in computer-human interactions, 2009. Second International Conferences on ACHI’09, IEEE, pp 44–48
Gupta V, Lehal GS (2010) A survey of text summarization extractive techniques. J Emerg Technol Web Intell 2:258–268
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28:100–108
Kruengkrai C, Jaruskulchai C (2003) Generic text summarization using local and global properties of sentences In: Web intelligence, 2003. WI 2003. Proceedings. International Conference on IEEE/WIC, IEEE, pp 201–206
Kupiec J, Pedersen J, Chen F (1995) A trainable document summarizer. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 68–73
Kyoomarsi F, Khosravi H, Eslami E, Dehkordy PK, Tajoddin A (2008) Optimizing text summarization based on fuzzy logic. In: ACIS-ICIS, pp 347–352
Lin CY (1999) Training a selection function for extraction. In: Proceedings of the eighth international conference on information and knowledge management, ACM, pp 55–62
Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theory 37:145–151
Radev DR, Hovy E, McKeown K (2002) Introduction to the special issue on summarization. Comput Linguist 28:399–408
Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on World wide web, ACM, pp 851–860
Salton G, McGill M (1984) Introduction to modern information retrieval. McGraw-Hill Book Company, New York
Svore KM, Vanderwende L, Burges CJC (2007) Enhancing single-document summarization by combining ranknet and third-party sources In EMNLP-CoNLL 2007, In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, Czech Republic, pp 448–457, 28–30 June 2007
Takamura H, Okumura M (2009) Text summarization model based on maximum coverage problem and its variant. In: Proceedings of the 12th conference of the european chapter of the association for computational linguistics, Association for Computational Linguistics, pp 781–789
Tsolmon B, Lee K (2014) An event extraction model based on timeline and user analysis in latent dirichlet allocation. In: The 37th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’14, Gold Coast, QLD, Australia, pp 1187–1190, 06–11 July 2014
Umetani, Yagiura (2007) Relaxation heuristics for the set covering problem. J Oper Res Soc Jpn 50:350–375
Yaghini M, Karimi M, Rahbar M (2013) A set covering approach for multi-depot train driver scheduling. J Comb Optim pp 1–19
Acknowledgments
This work is partially supported by the National Basic Research Program (973) of China (No. 2012CB316203) and NSFC under Grant Nos. 61402177, 61170838 and 61272036. The author would also like to thank Key Disciplines of Software Engineering of Shanghai Second Polytechnic University under Grant No. XXKZD1301 and Project of Shanghai Shen-kang Hospital Development Centre (No. 2014SKMR-04).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yan, J., Cheng, W., Wang, C. et al. Optimizing word set coverage for multi-event summarization. J Comb Optim 30, 996–1015 (2015). https://doi.org/10.1007/s10878-015-9855-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10878-015-9855-0