Comparative Analysis of the Informativeness and Encyclopedic Style of the Popular Web Information Sources

Khairova, Nina; Lewoniewski, Włodzimierz; Węcel, Krzysztof; Orken, Mamyrbayev; Kuralai, Mukhsina

doi:10.1007/978-3-319-93931-5_24

Nina Khairova⁸,
Włodzimierz Lewoniewski⁹,
Krzysztof Węcel⁹,
Mamyrbayev Orken¹⁰ &
…
Mukhsina Kuralai¹¹

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 320))

Included in the following conference series:

International Conference on Business Information Systems

2907 Accesses
8 Citations
2 Altmetric

Abstract

Nowadays, very often decision making relies on information that is found in the various Internet sources. Preferred are texts of the encyclopedic style, which contain mostly factual information. We propose to combine the logic-linguistic model and the universal dependency treebank to extract facts of various quality levels from texts. Based on Random Forest as a classification algorithm, we show the most significant types of facts and types of words that most affect the encyclopedic-style of the text. We evaluate our approach on four corpora based on Wikipedia, social and mass media texts. Our classifier achieves over 90% F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm.
2.
We use ‘Predicate’, ‘Subject’ and ‘Object’ with the first upper-case letters to emphasize the semantic meaning of words in a sentence.
3.
http://chaoticity.com/dependensee-a-dependency-parse-visualisation-tool/.
4.
http://universaldependencies.org/en/dep/.
5.
https://github.com/attardi/wikiextractor.
6.
Groups description available on the page http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm.
7.
https://www.nytimes.com/.
8.
https://www.theguardian.com.
9.
https://www.cs.waikato.ac.nz/ml/weka/index.html.

References

Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 1–10 (2015)
Article Google Scholar
Béjoint, H.: Modern Lexicography: An Introduction, pp. 30–31. Oxford University Press (2000)
Google Scholar
Khairova, N., Petrasova, S., Gautam, A.: The logical-linguistic model of fact extraction from English texts. In: International Conference on Information and Software Technologies, Communications in Computer and Information Science, CCIS 2016, pp. 625–635 (2016)
Google Scholar
Nivre, J., et al.: Universal dependencies v1: a multilingual treebank collection In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, May. European Language Resources Association (ELRA) (2016)
Google Scholar
Schler, J., Koppel, M., Argamon, S,. Pennebaker, J.: Effects of age and gender on blogging. In: Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, pp. 191–197 (2006)
Google Scholar
Leafgren, J.: Degrees of explicitness: information structure and the packaging of Bulgarian subjects and objects. John Benjamins, Amsterdam & Philadelphia (2002)
Book Google Scholar
Berman, R.A., Ravid, D.: Analyzing narrative informativeness in speech and writing. In: Tyler, A., Kim, Y., Takada, M. (eds.) Language in the Context of Use: Cognitive Approaches to Language and Language Learning. Cognitive Linguistics Research Series. pp. 79–101. Mouton de Gruyter, The Hague (2008)
Google Scholar
Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: Proceedings of SIGIR 2005, pp. 353–360 (2005)
Google Scholar
Kireyev, K.: Semantic-based estimation of term informativeness. In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pp. 530–538 (2009)
Google Scholar
Wu, Z., Giles, L.C.: Measuring term informativeness in context. In: Proceedings of NAACL 2013, Atlanta, Georgia, pp. 259–269 (2013)
Google Scholar
Shams, R.: Identification of informativeness in text using natural language stylometry. Electronic Thesis and Dissertation Repository, 2365 (2014)
Google Scholar
Huang, A.H., Zang, A.Y., Zheng, R.: Evidence on the information content of text in analyst reports. Acc. Rev. 89(6), 2151–2180 (2014)
Article Google Scholar
Sokolova, M., Lapalme, G.: How much do we say? Using informativeness of negotiation text records for early prediction of negotiation outcomes. Group Decis. Negot. 21(3), 363–379 (2012)
Article Google Scholar
Lex, E., Voelske, M., Errecalde, M., Ferretti, E., Cagnina, L., Horn, C., Granitzer, M.: Measuring the quality of web content using factual information. In: Proceedings of the 2nd joint WICOW/AIRWeb Workshop on Web Quality, pp. 7–10. ACM (2012)
Google Scholar
De Marneffe, M.C., Manning, C.D.: Stanford typed dependencies manual, pp. 338–345. Technical report. Stanford University (2008)
Google Scholar
Lewoniewski, W.: Enrichment of information in multilingual wikipedia based on quality analysis. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 216–227. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69023-0_19
Chapter Google Scholar
Węcel, K., Lewoniewski, W.: Modelling the quality of attributes in wikipedia infoboxes. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 228, pp. 308–320. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26762-3_27
Chapter Google Scholar
Lewoniewski, W., Węcel, K., Abramowicz, W.: Quality and importance of wikipedia articles in different languages. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 613–624. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46254-7_50
Chapter Google Scholar
McEnery, T., Hardie, A.: Corpus Linguistics: Method, Theory and Practice, pp. 48–52. Cambridge University Press, Cambridge (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

National Technical University “Kharkiv Polytechnic Institute”, Kharkiv, Ukraine
Nina Khairova
Poznań University of Economics and Business, Poznań, Poland
Włodzimierz Lewoniewski & Krzysztof Węcel
Institute of Information and Computational Technologies, Almaty, Kazakhstan
Mamyrbayev Orken
Al-Farabi Kazakh National University, Almaty, Kazakhstan
Mukhsina Kuralai

Authors

Nina Khairova
View author publications
You can also search for this author in PubMed Google Scholar
Włodzimierz Lewoniewski
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Węcel
View author publications
You can also search for this author in PubMed Google Scholar
Mamyrbayev Orken
View author publications
You can also search for this author in PubMed Google Scholar
Mukhsina Kuralai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nina Khairova .

Editor information

Editors and Affiliations

Poznan University of Economics and Business, Poznan, Poland
Witold Abramowicz
Fraunhofer FOKUS, Berlin, Germany
Adrian Paschke

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khairova, N., Lewoniewski, W., Węcel, K., Orken, M., Kuralai, M. (2018). Comparative Analysis of the Informativeness and Encyclopedic Style of the Popular Web Information Sources. In: Abramowicz, W., Paschke, A. (eds) Business Information Systems. BIS 2018. Lecture Notes in Business Information Processing, vol 320. Springer, Cham. https://doi.org/10.1007/978-3-319-93931-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-93931-5_24
Published: 16 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93930-8
Online ISBN: 978-3-319-93931-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics