Abstract
In this paper, we explore whether there exist ‘contrast’ features that help recognize if a text variety is a genre or a domain. We carry out our experiments on the text varieties that are included in the Swedish national corpus, called Stockholm-Umeå Corpus or SUC, and build several text classification models based on text complexity features, grammatical features, bag-of-words features and word embeddings. Results show that text complexity features and grammatical features systematically perform better on genres rather than on domains. This indicates that these features can be used as ‘contrast’ features because, when in doubt about the nature of a text category, they help bring it to light.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
Weighted Averaged F-Measure is the sum of all the classes F-measures, each weighted according to the number of instances with that particular class label. It is a more reliable metric than the harmonic F-measures (F1).
References
Altszyler, E., Sigman, M., Slezak, D.F.: Corpus specificity in LSA and word2vec: the role of out-of-domain documents. arXiv preprint arXiv:1712.10054 (2017)
Falkenjack, J., Heimann Mühlenbock, K., Jönsson, A.: Features indicating readability in Swedish text. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NoDaLiDa-2013), No. 085 in NEALT Proceedings Series 16, Oslo, Norway, pp. 27–40. Linköping University Electronic Press (2013)
Falkenjack, J., Rennes, E., Fahlborg, D., Johansson, V., Jönsson, A.: Services for text simplification and analysis. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 309–313 (2017)
Falkenjack, J., Santini, M., Jönsson, A.: An exploratory study on genre classification using readability features. In: Proceedings of the Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, Sweden (2016)
Francis, W.N., Kucera, H.: Brown Corpus Manual: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers. Brown University, Providence (1979)
Gustafson-Capková, S., Hartmann, B.: Manual of the Stockholm Umeå Corpus version 2.0. Stockholm University (2006)
Heimann Mühlenbock, K.: I see what you mean. Assessing readability for specific target groups. Dissertation, Språkbanken, Department of Swedish, University of Gothenburg (2013). http://hdl.handle.net/2077/32472
Johansson, S., Leech, G.N., Goodluck, H.: Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computer. University of Oslo, Department of English (1978)
Jönsson, S., Rennes, E., Falkenjack, J., Jönsson, A.: A component based approach to measuring text complexity. In: Proceedings of the Seventh Swedish Language Technology Conference 2018 (SLTC-2018) (2018)
Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proceedings of the 15th Conference on Computational Linguistics, vol. 2, pp. 1071–1075. Association for Computational Linguistics (1994)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Santini, M.: Automatic Identification of Genre in Web Pages: A New Perspective. LAP Lambert Academic Publishing, Saarbrücken (2011)
Wastholm, P., Kusma, A., Megyesi, B.: Using linguistic data for genre classification. In: Proceedings of the Swedish Artificial Intelligence and Learning Systems Event, SAIS-SSLS (2005)
Van der Wees, M., Bisazza, A., Monz, C.: Evaluation of machine translation performance across multiple genres and languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) (2018)
Van der Wees, M., Bisazza, A., Weerkamp, W., Monz, C.: What’s in a domain? Analyzing genre and topic differences in statistical machine translation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), vol. 2, pp. 560–566 (2015)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Cambridge (2016)
Acknowledgements
This research was supported by E-care@home, a “SIDUS – Strong Distributed Research Environment” project, funded by the Swedish Knowledge Foundation [kk-stiftelsen, Diarienr: 20140217]. Project website: http://ecareathome.se/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Santini, M., Danielsson, B., Jönsson, A. (2019). Introducing the Notion of ‘Contrast’ Features for Language Technology. In: Anderst-Kotsis, G., et al. Database and Expert Systems Applications. DEXA 2019. Communications in Computer and Information Science, vol 1062. Springer, Cham. https://doi.org/10.1007/978-3-030-27684-3_24
Download citation
DOI: https://doi.org/10.1007/978-3-030-27684-3_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27683-6
Online ISBN: 978-3-030-27684-3
eBook Packages: Computer ScienceComputer Science (R0)