Identification of Tweets that Mention Books: An Experimental Comparison of Machine Learning Methods

Yada, Shuntaro; Kageura, Kyo

doi:10.1007/978-3-319-27974-9_30

Shuntaro Yada¹⁶ &
Kyo Kageura¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9469))

Included in the following conference series:

International Conference on Asian Digital Libraries

2904 Accesses
2 Citations

Abstract

In this paper, we address the task of the identification of tweets on Twitter that mention books (TMB) among tweets that contain the same strings as full book titles. Although this task can be treated as a kind of Named Entity Recognition, the fact that book titles consist of ordinary expressions (such as “The Girl on the Train”) makes the task harder. Furthermore, if tweets are gathered through a dictionary-based search, the tweets that contain the same strings as full book titles are often spam. However, assuming a complete list of book titles (i.e. from a union catalogue from a library or commercial bibliographic data from a book store), this task can be solved by text classification. Thus, we proposed a two-step pipeline consisting of spam filtering and TMB classification based on supervised learning with a small amount of labelled data. We constructed optimal classifiers by comparing combinations of four proven supervised learning methods with different features. Given the difficulty of the task, our pipeline performed highly (about 0.7 in terms of F-score).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yada, S.: Development of a book recommendation system to inspire “infrequent readers”. In: Tuamsuk, K., Jatowt, A., Rasmussen, E. (eds.) ICADL 2014. LNCS, vol. 8839, pp. 399–404. Springer, Heidelberg (2014)
Google Scholar
Adobe: Click Here: The State of Online Advertising. Tech. rep., Adobe Systems Incorporated (2013)
Google Scholar
Nadeau, D., Sekine, S.: A Survey of Named Entity Eecognition and Classification. Lingvisticae Investigationes 30(1991), 3–26 (2007)
Article Google Scholar
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data SE - 6, pp. 163–222. Springer US, Boston (2012)
Chapter Google Scholar
Liu, X., Zhang, S., Wei, F., Zhou, M.: Recognizing named entities in tweets. In: 49th Annual Meeting of the Association for Computational Linguistics, pp. 359−367. ACL, June 2011
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI Workshop on Learning for Text Categorization, pp. 41−48 (1998)
Google Scholar
Nigam, K.: Using maximum entropy for text classification. In: Workshop on Machine Learning for Information Filtering, pp. 61−67 (1999)
Google Scholar
Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3), 273–297 (1995)
MATH Google Scholar
Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
Article MATH Google Scholar
McCord, M., Chuah, M.: Spam detection on twitter using traditional classifiers. In: Calero, J.M., Yang, L.T., Mármol, F.G., García Villalba, L.J., Li, A.X., Wang, Y. (eds.) ATC 2011. LNCS, vol. 6906, pp. 175–186. Springer, Heidelberg (2011)
Chapter Google Scholar
Lea, D.: Detecting spam bots in online social networking sites: a machine learning approach. In: Foresti, S., Jajodia, S. (eds.) Data and Applications Security and Privacy XXIV. LNCS, vol. 6166, pp. 335–342. Springer, Heidelberg (2010)
Chapter Google Scholar
Go, A., Bhayani, R., Huang, L.: Twitter Sentiment Classification Using Distant Supervision. Tech. rep., Stanford (2009)
Google Scholar
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) The 17th International Conference on Language Resources and Evaluation, pp. 1320–1326. ELRA, Valletta (2010)
Google Scholar
Prasetyo, P.K., Lo, D., Achananuparp, P., Tian, Y., Lim, E.P.: Automatic classification of software related microblogs. In: 28th International Conference on Software Maintenance, pp. 596−599. IEEE, September 2012
Google Scholar
Aramaki, E., Maskawa, S., Morita, M.: Twitter catches the flu: detecting influenza epidemics using twitter. In: The Conference on Empirical Methods in Natural Language Processing, pp. 1568−1576. ACL, Stroudsburg (2011)
Google Scholar
Tuarob, S., Tucker, C.S., Salathe, M., Ram, N.: An Ensemble Heterogeneous Classification Methodology for Discovering Health-related Knowledge in Social Media Messages. Journal of Biomedical Informatics 49, 255–268 (2014)
Article Google Scholar
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Conference on Empirical Methods in Natural Language Processing, pp. 1524−1534. ACL, July 2011
Google Scholar
Kou, Z., Cohen, W.W., Murphy, R.F.: High-recall Protein Entity Recognition Using a Dictionary. Bioinformatics 21(Suppl 1), i266–i273 (2005)
Article Google Scholar
Yoshida, K., Tsujii, J.: Reranking for biomedical named-entity recognition. In: Workshop on Biological, Translational, and Clinical Language Processing, pp. 209−216. ACL, June 2007
Google Scholar
Murai, H., Kawashima, T., Kudou, A.: Quantitative Analysis Concerning the Relationships and Roles of Pronouns in Movie and Theater Critiques (in Japanese). Journal of Japan Society of Information and Knowledge 22(1), 23–43 (2012)
Article Google Scholar
Abekawa, T., Nanba, H., Takamura, H., Okumura, M.: Automatic Extraction of Bibliography with Machine Learning (in Japanese). IPSJ SIG Notes 2003(98), 83–90 (2003)
Google Scholar
Kousha, K., Thelwall, M.: An Automatic Method for Extracting Citations from Google Books. Journal of the Association for Information Science and Technology 66(2), 309–320 (2015)
Article Google Scholar
Brin, S.: Extracting patterns and relations from the world wide web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999)
Chapter Google Scholar
Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in web text. In: International Joint Conference on Artificial Intelligence, pp. 2733−2739 (2007)
Google Scholar
Yada, S., Kageura, K.: Categorization of tweets mentioning books based on text clustering (in japanese). In: IEICE Technical Committee of Natural Language Understanding and Models of Communication, pp. 61−66 (2015)
Google Scholar
Chinnasamy, D.G., Mohanraj, V.: A Survey on Spam Detection in Twitter. International Journal of Computer Science and Business Informatics 14(1), 92–102 (2014)
Google Scholar
Ostrowski, D.A.: Feature selection for twitter classification. In: Eighth International Conference on Semantic Computing, pp. 267−272. IEEE, June 2014
Google Scholar
Kashioka, H.: Analysis of synonym obtained from redirection of wikipedia (in japanese). In: The 13th Annual Meeting of the Association for Natural Language Processing, pp. 1094−1096 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Education, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
Shuntaro Yada & Kyo Kageura

Authors

Shuntaro Yada
View author publications
You can also search for this author in PubMed Google Scholar
Kyo Kageura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuntaro Yada .

Editor information

Editors and Affiliations

Yonsei University, Seoul, Korea (Republic of)
Robert B. Allen
School of ITEE, University of Queensland, St. Lucia, Queensland, Australia
Jane Hunter
School of Library & Info Sci, Kent State University, KENT, Ohio, USA
Marcia L. Zeng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yada, S., Kageura, K. (2015). Identification of Tweets that Mention Books: An Experimental Comparison of Machine Learning Methods. In: Allen, R., Hunter, J., Zeng, M. (eds) Digital Libraries: Providing Quality Information. ICADL 2015. Lecture Notes in Computer Science(), vol 9469. Springer, Cham. https://doi.org/10.1007/978-3-319-27974-9_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-27974-9_30
Published: 18 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27973-2
Online ISBN: 978-3-319-27974-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics