Introducing the Notion of ‘Contrast’ Features for Language Technology

Santini, Marina; Danielsson, Benjamin; Jönsson, Arne

doi:10.1007/978-3-030-27684-3_24

Marina Santini²⁰,
Benjamin Danielsson²¹ &
Arne Jönsson^21,22

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1062))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

600 Accesses
1 Citations

Abstract

In this paper, we explore whether there exist ‘contrast’ features that help recognize if a text variety is a genre or a domain. We carry out our experiments on the text varieties that are included in the Swedish national corpus, called Stockholm-Umeå Corpus or SUC, and build several text classification models based on text complexity features, grammatical features, bag-of-words features and word embeddings. Results show that text complexity features and grammatical features systematically perform better on genres rather than on domains. This indicates that these features can be used as ‘contrast’ features because, when in doubt about the nature of a text category, they help bring it to light.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See https://spraakbanken.gu.se/eng/resources/corpus.
2.
See http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html.
3.
See https://deeplearning.cms.waikato.ac.nz/.
4.
Weighted Averaged F-Measure is the sum of all the classes F-measures, each weighted according to the number of instances with that particular class label. It is a more reliable metric than the harmonic F-measures (F1).

References

Altszyler, E., Sigman, M., Slezak, D.F.: Corpus specificity in LSA and word2vec: the role of out-of-domain documents. arXiv preprint arXiv:1712.10054 (2017)
Falkenjack, J., Heimann Mühlenbock, K., Jönsson, A.: Features indicating readability in Swedish text. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NoDaLiDa-2013), No. 085 in NEALT Proceedings Series 16, Oslo, Norway, pp. 27–40. Linköping University Electronic Press (2013)
Google Scholar
Falkenjack, J., Rennes, E., Fahlborg, D., Johansson, V., Jönsson, A.: Services for text simplification and analysis. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 309–313 (2017)
Google Scholar
Falkenjack, J., Santini, M., Jönsson, A.: An exploratory study on genre classification using readability features. In: Proceedings of the Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, Sweden (2016)
Google Scholar
Francis, W.N., Kucera, H.: Brown Corpus Manual: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers. Brown University, Providence (1979)
Google Scholar
Gustafson-Capková, S., Hartmann, B.: Manual of the Stockholm Umeå Corpus version 2.0. Stockholm University (2006)
Google Scholar
Heimann Mühlenbock, K.: I see what you mean. Assessing readability for specific target groups. Dissertation, Språkbanken, Department of Swedish, University of Gothenburg (2013). http://hdl.handle.net/2077/32472
Johansson, S., Leech, G.N., Goodluck, H.: Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computer. University of Oslo, Department of English (1978)
Google Scholar
Jönsson, S., Rennes, E., Falkenjack, J., Jönsson, A.: A component based approach to measuring text complexity. In: Proceedings of the Seventh Swedish Language Technology Conference 2018 (SLTC-2018) (2018)
Google Scholar
Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proceedings of the 15th Conference on Computational Linguistics, vol. 2, pp. 1071–1075. Association for Computational Linguistics (1994)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Santini, M.: Automatic Identification of Genre in Web Pages: A New Perspective. LAP Lambert Academic Publishing, Saarbrücken (2011)
Google Scholar
Wastholm, P., Kusma, A., Megyesi, B.: Using linguistic data for genre classification. In: Proceedings of the Swedish Artificial Intelligence and Learning Systems Event, SAIS-SSLS (2005)
Google Scholar
Van der Wees, M., Bisazza, A., Monz, C.: Evaluation of machine translation performance across multiple genres and languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) (2018)
Google Scholar
Van der Wees, M., Bisazza, A., Weerkamp, W., Monz, C.: What’s in a domain? Analyzing genre and topic differences in statistical machine translation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), vol. 2, pp. 560–566 (2015)
Google Scholar
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Cambridge (2016)
Google Scholar

Download references

Acknowledgements

This research was supported by E-care@home, a “SIDUS – Strong Distributed Research Environment” project, funded by the Swedish Knowledge Foundation [kk-stiftelsen, Diarienr: 20140217]. Project website: http://ecareathome.se/

Author information

Authors and Affiliations

Division ICT-RISE SICS East, RISE Research Institutes of Sweden, Stockholm, Sweden
Marina Santini
Department of Computer and Information Science, Linköping University, Linköping, Sweden
Benjamin Danielsson & Arne Jönsson
Division ICT-RISE SICS East, RISE Research Institutes of Sweden, Linköping, Sweden
Arne Jönsson

Authors

Marina Santini
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Danielsson
View author publications
You can also search for this author in PubMed Google Scholar
Arne Jönsson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Santini .

Editor information

Editors and Affiliations

Institute of Telecooperation, Johannes Kepler University of Linz, Linz, Oberösterreich, Austria
Gabriele Anderst-Kotsis
Software Competence Center Hagenberg, Hagenberg, Austria
A Min Tjoa
Institute of Telecooperation, Johannes Kepler University of Linz, Linz, Oberösterreich, Austria
Ismail Khalil
ENSIT, LaTICE, University of Tunis, Tunis, Tunisia
Mourad Elloumi
Software Competence Center, Hagenberg, Austria
Atif Mashkoor
Steyregg, Oberösterreich, Austria
Johannes Sametinger
Edificio 204, ICT Division,TECNALIA, Derio, Vizcaya, Spain
Xabier Larrucea
Top 74, Innsbruck, Tirol, Austria
Anna Fensel
Hagenberg Gmbh, Software Competence Center, Hagenberg im Mühlkreis, Oberösterreich, Austria
Jorge Martinez-Gil
Software Competence Center Hagenberg, SC, Hagenberg im Mühlkreis, Oberösterreich, Austria
Bernhard Moser
University of Twente, ENSCHEDE, Overijssel, The Netherlands
Christin Seifert
Bauhaus Universität Weimar, Weimar, Thüringen, Germany
Benno Stein
MiCS, Media Computer Science, University of Passau, Passau, Bayern, Germany
Michael Granitzer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santini, M., Danielsson, B., Jönsson, A. (2019). Introducing the Notion of ‘Contrast’ Features for Language Technology. In: Anderst-Kotsis, G., et al. Database and Expert Systems Applications. DEXA 2019. Communications in Computer and Information Science, vol 1062. Springer, Cham. https://doi.org/10.1007/978-3-030-27684-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-27684-3_24
Published: 01 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27683-6
Online ISBN: 978-3-030-27684-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics