Applying Lexical-Conceptual Knowledge for Multilingual Multi-document Summarization

Di Felippo, Ariani; Tosta, Fabrício E. S.; Pardo, Thiago A. S.

doi:10.1007/978-3-319-41552-9_4

Ariani Di Felippo^18,19,
Fabrício E. S. Tosta¹⁸ &
Thiago A. S. Pardo^18,20

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9727))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

614 Accesses
1 Citations

Abstract

We define Multilingual Multi-Document Summarization (MMDS) as the process of identifying the main information of a cluster with (at least) two texts, one in the user’s language and one in a foreign language, and presenting it as a summary in the user’s language. Although it is a relevant task due to the increasing amount of on-line information in different languages, there are only baselines for (Brazilian) Portuguese, which apply machine-translation to obtain a monolingual input and superficial features for sentence extraction. We report our investigation on the application of conceptual frequency measure to build a summary in Portuguese from a bilingual cluster (Portuguese and English). The methods tackle two additional challenges: using Princeton WordNet for nouns annotation and applying MT to translate selected sentences in English to Portuguese. The experiments were performed using a corpus of 20 clusters, and show that lexical-conceptual knowledge improves the linguistic quality and informativeness of extracts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Summarization technique that involves ranking sentences using some scoring mechanism, picking the top scoring sentences, and concatenating them in a certain order to build the summary [5].
2.
http://www.nilc.icmc.usp.br/nilc/index.php/team?id=23#resource.
3.
A semantic network of English in which the meanings of word forms and expressions of noun, verb, adjective, and adverb classes are organized into “sets of synonyms” (synsets). Each synset expresses a distinct concept/sense and the synsets are interlinked through conceptual-semantic (i.e., hyponymy, meronymy, entailment, and cause) and lexical (i.e., antonymy) relations [11].
4.
The nouns are grouped into concept sets using WN.Pr synsets, and hyponymy relation. To build a set, the highly polysemous nouns are not disambiguated, but replaced by others that are strongly related with the same verb (e.g., “officer” is replaced by “policeman” due to the relation with “arrest”). Having the sets, the sentence ranking is based on the concepts frequency [6].
5.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) computes the number of common n-grams among the automatic and reference/human summaries, being able to rank automatic summaries as well as humans would do, as its author has shown [14].
6.
In this method, the first-sentence from each text in the cluster is selected until a maximum of words/bytes is reached, and, if the first sentence was already included from each text in the set, the second sentence from each text is included in the summary, and so on [3].
7.
http://newsblaster.cs.columbia.edu/.
8.
http://translate.google.com/.
9.
This “ontology” consists of a generalization/specialization hierarchy of concepts (i.e., a taxonomy).
10.
Summaries that contain some degree of paraphrase of the input.
11.
http://www.icmc.usp.br/pessoas/taspardo/sucinto/resources.html.
12.
We thank Fernando A. A. Nóbrega for helping adapting the tool.
13.
http://www.wordreference.com/.
14.
We have excluded others resources (e.g., Google Translation) because of use/license limitations.

References

Evans, D.K., Klavans, J.L., Mckeown, K.R.: Columbia NewsBlaster: multilingual news summarization on the web. In: North American Chapter of The Association for Computational Linguistics: Human Language Technologies, p. 1–4, Boston (2004)
Google Scholar
Roark, B., Fisher, S.: OGI OHSU baseline multilingual multi-document summarization system. In: Multilingual Summarization Evaluation (MSE), Michigan, USA (2005)
Google Scholar
Evans, D.K., Klavans, J.L. Mckeown, K.R.: Similarity-based multilingual multi-document summarization. Technical report CUCS-014-05. Columbia University, New York (2005)
Google Scholar
Tosta, F.E.S., Di-Felippo, A., Pardo, T.A.S.: Estudo de métodos clássicos de sumarização automática no cenário multidocumento multilíngue. In: 4th Workshop de IC em Tecnologia da Informação e da Linguagem Humana, pp. 34–36, Fortaleza, Brazil (2013)
Google Scholar
Mani, I.: Automatic Summarization. John Benjamins Publishing Co., Amsterdam (2004)
MATH Google Scholar
Schiffman, B., Nenkova, A., Mckeown, A.: Experiments in multi-document summarization. In: 2nd International Conference on HLT Research, pp. 52–58, San Francisco (2002)
Google Scholar
Ye, S., Chua, T.-S., Kan, M.-Y., Qiu, L.: Document concept lattice for text understanding and summarization. Inf. Process. Manag. 43(6), 1643–1662 (2007)
Article Google Scholar
Wu, C.-W., Liu, C.-L.: Ontology-based text summarization for business news articles. Comput. Appl. 2003, 389–392 (2003)
Google Scholar
Hennig, L., Umbrath, W., Wetzker, R.: An ontology-based approach to text summarization. In: 3th Workshop on Natural Language Processing and Ontology Engineering (NLPOE), pp. 291–294, Toronto, Canada (2008)
Google Scholar
Tosta, F.E.S.: Aplicação de conhecimento léxico-conceitual na Sumarização Multidocumento Multilíngue. 2013. Dissertação (Mestrado em Linguística)–Departamento de Letras, Universidade Federal de São Carlos (2014)
Google Scholar
Fellbaum, C. (ed.): Wordnet: an Electronic Lexical Database (Language, Speech and Communication). MIT Press, Massachusetts (1998)
MATH Google Scholar
Hatzivassiloglou, J.L., Klavans J.L., Holcombe, M.: Simfinder: a flexible clustering tool for summarization. In: NAACL Automatic Summarization Workshop, p. 9, Pittsburgh (2001)
Google Scholar
Lin, C-Y.: ROUGE: a package for automatic evaluation of summaries. In: Workshop on Text Summarization Branches Out (2004)
Google Scholar
Kumar, Y.J., Salim, N.: Automatic multi-document summarization approaches. J. Comput. Sci. 8(1), 133–140 (2012). ISSN 1549-3636
Article Google Scholar
Dang, H.T.: Overview of DUC 2005. In: Document Understanding Conference (2005)
Google Scholar
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA (1996)
Google Scholar
Sparck-Jones, K.: Automatic summarizing: factors and directions. In: Mani, I., Maybury, M.T. (eds.) Advances in Automatic Text Summarization, pp. 1–14. MIT Press, MA (1999)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing, pp. 44–49, Manchester, UK (1994)
Google Scholar
Nóbrega, F.A.A.: Desambiguação lexical de sentidos para o português por meio de uma abordagem multilíngue mono e multidocumento. Dissertação (Mestrado em Ciências de Computação e Matemática Computacional) - ICMC, USP, São Carlos (2013)
Google Scholar
Nóbrega, F.A.A., Pardo, T.A.S.: General purpose word sense disambiguation methods for nouns in Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A., Volpe Nunes, Md.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 94–101. Springer, Heidelberg (2014)
Google Scholar
Pedersen, T., Kolhatkar, V.: WordNet::SenseRelate::AllWords - a broad coverage word sense tagger that maximizes semantic relatedness. In: North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference, pp. 17–20, Boulder, Colorado (2009)
Google Scholar

Download references

Acknowledgments

We thank the Brazilian National Council for Scientific and Technological Development (CNPq) (#483231/2012-6), the State of São Paulo Research Foundation (FAPESP) (#2012/13246-5, #2015/17841-3), and Coordination for the Improvement of Higher Level or Education Personnel (CAPES) for the financial support.

Author information

Authors and Affiliations

Interinstitutional Center for Computational Linguistics (NILC), São Carlos, SP, Brazil
Ariani Di Felippo, Fabrício E. S. Tosta & Thiago A. S. Pardo
Language and Literature Department (DL), Federal University of São Carlos (UFSCar), Rodovia Washington Luís, km 235 - SP 310, São Carlos, 13565-905, Brazil
Ariani Di Felippo
Institute of Mathematical and Computer Sciences (ICMC), University of São Paulo (USP), Avenida Trabalhador São-carlense, 400, São Carlos, 13566-590, Brazil
Thiago A. S. Pardo

Authors

Ariani Di Felippo
View author publications
You can also search for this author in PubMed Google Scholar
Fabrício E. S. Tosta
View author publications
You can also search for this author in PubMed Google Scholar
Thiago A. S. Pardo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thiago A. S. Pardo .

Editor information

Editors and Affiliations

Universidade de Lisbon, Portugal
João Silva
ISCTE-IUL, Lisbon, Portugal
Ricardo Ribeiro
Universidade de Évora, Évora, Portugal
Paulo Quaresma
Universidade de Caxias do Sul, Caxias do Suö, Brazil
André Adami
Universidade de Lisbon, Lisboa, Portugal
António Branco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Di Felippo, A., Tosta, F.E.S., Pardo, T.A.S. (2016). Applying Lexical-Conceptual Knowledge for Multilingual Multi-document Summarization. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-41552-9_4
Published: 21 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics