Abstract
We define Multilingual Multi-Document Summarization (MMDS) as the process of identifying the main information of a cluster with (at least) two texts, one in the user’s language and one in a foreign language, and presenting it as a summary in the user’s language. Although it is a relevant task due to the increasing amount of on-line information in different languages, there are only baselines for (Brazilian) Portuguese, which apply machine-translation to obtain a monolingual input and superficial features for sentence extraction. We report our investigation on the application of conceptual frequency measure to build a summary in Portuguese from a bilingual cluster (Portuguese and English). The methods tackle two additional challenges: using Princeton WordNet for nouns annotation and applying MT to translate selected sentences in English to Portuguese. The experiments were performed using a corpus of 20 clusters, and show that lexical-conceptual knowledge improves the linguistic quality and informativeness of extracts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Summarization technique that involves ranking sentences using some scoring mechanism, picking the top scoring sentences, and concatenating them in a certain order to build the summary [5].
- 2.
- 3.
A semantic network of English in which the meanings of word forms and expressions of noun, verb, adjective, and adverb classes are organized into “sets of synonyms” (synsets). Each synset expresses a distinct concept/sense and the synsets are interlinked through conceptual-semantic (i.e., hyponymy, meronymy, entailment, and cause) and lexical (i.e., antonymy) relations [11].
- 4.
The nouns are grouped into concept sets using WN.Pr synsets, and hyponymy relation. To build a set, the highly polysemous nouns are not disambiguated, but replaced by others that are strongly related with the same verb (e.g., “officer” is replaced by “policeman” due to the relation with “arrest”). Having the sets, the sentence ranking is based on the concepts frequency [6].
- 5.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) computes the number of common n-grams among the automatic and reference/human summaries, being able to rank automatic summaries as well as humans would do, as its author has shown [14].
- 6.
In this method, the first-sentence from each text in the cluster is selected until a maximum of words/bytes is reached, and, if the first sentence was already included from each text in the set, the second sentence from each text is included in the summary, and so on [3].
- 7.
- 8.
- 9.
This “ontology” consists of a generalization/specialization hierarchy of concepts (i.e., a taxonomy).
- 10.
Summaries that contain some degree of paraphrase of the input.
- 11.
- 12.
We thank Fernando A. A. Nóbrega for helping adapting the tool.
- 13.
- 14.
We have excluded others resources (e.g., Google Translation) because of use/license limitations.
References
Evans, D.K., Klavans, J.L., Mckeown, K.R.: Columbia NewsBlaster: multilingual news summarization on the web. In: North American Chapter of The Association for Computational Linguistics: Human Language Technologies, p. 1–4, Boston (2004)
Roark, B., Fisher, S.: OGI OHSU baseline multilingual multi-document summarization system. In: Multilingual Summarization Evaluation (MSE), Michigan, USA (2005)
Evans, D.K., Klavans, J.L. Mckeown, K.R.: Similarity-based multilingual multi-document summarization. Technical report CUCS-014-05. Columbia University, New York (2005)
Tosta, F.E.S., Di-Felippo, A., Pardo, T.A.S.: Estudo de métodos clássicos de sumarização automática no cenário multidocumento multilíngue. In: 4th Workshop de IC em Tecnologia da Informação e da Linguagem Humana, pp. 34–36, Fortaleza, Brazil (2013)
Mani, I.: Automatic Summarization. John Benjamins Publishing Co., Amsterdam (2004)
Schiffman, B., Nenkova, A., Mckeown, A.: Experiments in multi-document summarization. In: 2nd International Conference on HLT Research, pp. 52–58, San Francisco (2002)
Ye, S., Chua, T.-S., Kan, M.-Y., Qiu, L.: Document concept lattice for text understanding and summarization. Inf. Process. Manag. 43(6), 1643–1662 (2007)
Wu, C.-W., Liu, C.-L.: Ontology-based text summarization for business news articles. Comput. Appl. 2003, 389–392 (2003)
Hennig, L., Umbrath, W., Wetzker, R.: An ontology-based approach to text summarization. In: 3th Workshop on Natural Language Processing and Ontology Engineering (NLPOE), pp. 291–294, Toronto, Canada (2008)
Tosta, F.E.S.: Aplicação de conhecimento léxico-conceitual na Sumarização Multidocumento Multilíngue. 2013. Dissertação (Mestrado em Linguística)–Departamento de Letras, Universidade Federal de São Carlos (2014)
Fellbaum, C. (ed.): Wordnet: an Electronic Lexical Database (Language, Speech and Communication). MIT Press, Massachusetts (1998)
Hatzivassiloglou, J.L., Klavans J.L., Holcombe, M.: Simfinder: a flexible clustering tool for summarization. In: NAACL Automatic Summarization Workshop, p. 9, Pittsburgh (2001)
Lin, C-Y.: ROUGE: a package for automatic evaluation of summaries. In: Workshop on Text Summarization Branches Out (2004)
Kumar, Y.J., Salim, N.: Automatic multi-document summarization approaches. J. Comput. Sci. 8(1), 133–140 (2012). ISSN 1549-3636
Dang, H.T.: Overview of DUC 2005. In: Document Understanding Conference (2005)
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA (1996)
Sparck-Jones, K.: Automatic summarizing: factors and directions. In: Mani, I., Maybury, M.T. (eds.) Advances in Automatic Text Summarization, pp. 1–14. MIT Press, MA (1999)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing, pp. 44–49, Manchester, UK (1994)
Nóbrega, F.A.A.: Desambiguação lexical de sentidos para o português por meio de uma abordagem multilíngue mono e multidocumento. Dissertação (Mestrado em Ciências de Computação e Matemática Computacional) - ICMC, USP, São Carlos (2013)
Nóbrega, F.A.A., Pardo, T.A.S.: General purpose word sense disambiguation methods for nouns in Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A., Volpe Nunes, Md.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 94–101. Springer, Heidelberg (2014)
Pedersen, T., Kolhatkar, V.: WordNet::SenseRelate::AllWords - a broad coverage word sense tagger that maximizes semantic relatedness. In: North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference, pp. 17–20, Boulder, Colorado (2009)
Acknowledgments
We thank the Brazilian National Council for Scientific and Technological Development (CNPq) (#483231/2012-6), the State of São Paulo Research Foundation (FAPESP) (#2012/13246-5, #2015/17841-3), and Coordination for the Improvement of Higher Level or Education Personnel (CAPES) for the financial support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Di Felippo, A., Tosta, F.E.S., Pardo, T.A.S. (2016). Applying Lexical-Conceptual Knowledge for Multilingual Multi-document Summarization. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-41552-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)