Skip to main content

Applying Lexical-Conceptual Knowledge for Multilingual Multi-document Summarization

  • Conference paper
  • First Online:
Computational Processing of the Portuguese Language (PROPOR 2016)

Abstract

We define Multilingual Multi-Document Summarization (MMDS) as the process of identifying the main information of a cluster with (at least) two texts, one in the user’s language and one in a foreign language, and presenting it as a summary in the user’s language. Although it is a relevant task due to the increasing amount of on-line information in different languages, there are only baselines for (Brazilian) Portuguese, which apply machine-translation to obtain a monolingual input and superficial features for sentence extraction. We report our investigation on the application of conceptual frequency measure to build a summary in Portuguese from a bilingual cluster (Portuguese and English). The methods tackle two additional challenges: using Princeton WordNet for nouns annotation and applying MT to translate selected sentences in English to Portuguese. The experiments were performed using a corpus of 20 clusters, and show that lexical-conceptual knowledge improves the linguistic quality and informativeness of extracts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Summarization technique that involves ranking sentences using some scoring mechanism, picking the top scoring sentences, and concatenating them in a certain order to build the summary [5].

  2. 2.

    http://www.nilc.icmc.usp.br/nilc/index.php/team?id=23#resource.

  3. 3.

    A semantic network of English in which the meanings of word forms and expressions of noun, verb, adjective, and adverb classes are organized into “sets of synonyms” (synsets). Each synset expresses a distinct concept/sense and the synsets are interlinked through conceptual-semantic (i.e., hyponymy, meronymy, entailment, and cause) and lexical (i.e., antonymy) relations [11].

  4. 4.

    The nouns are grouped into concept sets using WN.Pr synsets, and hyponymy relation. To build a set, the highly polysemous nouns are not disambiguated, but replaced by others that are strongly related with the same verb (e.g., “officer” is replaced by “policeman” due to the relation with “arrest”). Having the sets, the sentence ranking is based on the concepts frequency [6].

  5. 5.

    ROUGE (Recall-Oriented Understudy for Gisting Evaluation) computes the number of common n-grams among the automatic and reference/human summaries, being able to rank automatic summaries as well as humans would do, as its author has shown [14].

  6. 6.

    In this method, the first-sentence from each text in the cluster is selected until a maximum of words/bytes is reached, and, if the first sentence was already included from each text in the set, the second sentence from each text is included in the summary, and so on [3].

  7. 7.

    http://newsblaster.cs.columbia.edu/.

  8. 8.

    http://translate.google.com/.

  9. 9.

    This “ontology” consists of a generalization/specialization hierarchy of concepts (i.e., a taxonomy).

  10. 10.

    Summaries that contain some degree of paraphrase of the input.

  11. 11.

    http://www.icmc.usp.br/pessoas/taspardo/sucinto/resources.html.

  12. 12.

    We thank Fernando A. A. Nóbrega for helping adapting the tool.

  13. 13.

    http://www.wordreference.com/.

  14. 14.

    We have excluded others resources (e.g., Google Translation) because of use/license limitations.

References

  1. Evans, D.K., Klavans, J.L., Mckeown, K.R.: Columbia NewsBlaster: multilingual news summarization on the web. In: North American Chapter of The Association for Computational Linguistics: Human Language Technologies, p. 1–4, Boston (2004)

    Google Scholar 

  2. Roark, B., Fisher, S.: OGI OHSU baseline multilingual multi-document summarization system. In: Multilingual Summarization Evaluation (MSE), Michigan, USA (2005)

    Google Scholar 

  3. Evans, D.K., Klavans, J.L. Mckeown, K.R.: Similarity-based multilingual multi-document summarization. Technical report CUCS-014-05. Columbia University, New York (2005)

    Google Scholar 

  4. Tosta, F.E.S., Di-Felippo, A., Pardo, T.A.S.: Estudo de métodos clássicos de sumarização automática no cenário multidocumento multilíngue. In: 4th Workshop de IC em Tecnologia da Informação e da Linguagem Humana, pp. 34–36, Fortaleza, Brazil (2013)

    Google Scholar 

  5. Mani, I.: Automatic Summarization. John Benjamins Publishing Co., Amsterdam (2004)

    MATH  Google Scholar 

  6. Schiffman, B., Nenkova, A., Mckeown, A.: Experiments in multi-document summarization. In: 2nd International Conference on HLT Research, pp. 52–58, San Francisco (2002)

    Google Scholar 

  7. Ye, S., Chua, T.-S., Kan, M.-Y., Qiu, L.: Document concept lattice for text understanding and summarization. Inf. Process. Manag. 43(6), 1643–1662 (2007)

    Article  Google Scholar 

  8. Wu, C.-W., Liu, C.-L.: Ontology-based text summarization for business news articles. Comput. Appl. 2003, 389–392 (2003)

    Google Scholar 

  9. Hennig, L., Umbrath, W., Wetzker, R.: An ontology-based approach to text summarization. In: 3th Workshop on Natural Language Processing and Ontology Engineering (NLPOE), pp. 291–294, Toronto, Canada (2008)

    Google Scholar 

  10. Tosta, F.E.S.: Aplicação de conhecimento léxico-conceitual na Sumarização Multidocumento Multilíngue. 2013. Dissertação (Mestrado em Linguística)–Departamento de Letras, Universidade Federal de São Carlos (2014)

    Google Scholar 

  11. Fellbaum, C. (ed.): Wordnet: an Electronic Lexical Database (Language, Speech and Communication). MIT Press, Massachusetts (1998)

    MATH  Google Scholar 

  12. Hatzivassiloglou, J.L., Klavans J.L., Holcombe, M.: Simfinder: a flexible clustering tool for summarization. In: NAACL Automatic Summarization Workshop, p. 9, Pittsburgh (2001)

    Google Scholar 

  13. Lin, C-Y.: ROUGE: a package for automatic evaluation of summaries. In: Workshop on Text Summarization Branches Out (2004)

    Google Scholar 

  14. Kumar, Y.J., Salim, N.: Automatic multi-document summarization approaches. J. Comput. Sci. 8(1), 133–140 (2012). ISSN 1549-3636

    Article  Google Scholar 

  15. Dang, H.T.: Overview of DUC 2005. In: Document Understanding Conference (2005)

    Google Scholar 

  16. Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA (1996)

    Google Scholar 

  17. Sparck-Jones, K.: Automatic summarizing: factors and directions. In: Mani, I., Maybury, M.T. (eds.) Advances in Automatic Text Summarization, pp. 1–14. MIT Press, MA (1999)

    Google Scholar 

  18. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing, pp. 44–49, Manchester, UK (1994)

    Google Scholar 

  19. Nóbrega, F.A.A.: Desambiguação lexical de sentidos para o português por meio de uma abordagem multilíngue mono e multidocumento. Dissertação (Mestrado em Ciências de Computação e Matemática Computacional) - ICMC, USP, São Carlos (2013)

    Google Scholar 

  20. Nóbrega, F.A.A., Pardo, T.A.S.: General purpose word sense disambiguation methods for nouns in Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A., Volpe Nunes, Md.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 94–101. Springer, Heidelberg (2014)

    Google Scholar 

  21. Pedersen, T., Kolhatkar, V.: WordNet::SenseRelate::AllWords - a broad coverage word sense tagger that maximizes semantic relatedness. In: North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference, pp. 17–20, Boulder, Colorado (2009)

    Google Scholar 

Download references

Acknowledgments

We thank the Brazilian National Council for Scientific and Technological Development (CNPq) (#483231/2012-6), the State of São Paulo Research Foundation (FAPESP) (#2012/13246-5, #2015/17841-3), and Coordination for the Improvement of Higher Level or Education Personnel (CAPES) for the financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thiago A. S. Pardo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Di Felippo, A., Tosta, F.E.S., Pardo, T.A.S. (2016). Applying Lexical-Conceptual Knowledge for Multilingual Multi-document Summarization. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41552-9_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41551-2

  • Online ISBN: 978-3-319-41552-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics