Abstract
In this paper, we address the task of the identification of tweets on Twitter that mention books (TMB) among tweets that contain the same strings as full book titles. Although this task can be treated as a kind of Named Entity Recognition, the fact that book titles consist of ordinary expressions (such as “The Girl on the Train”) makes the task harder. Furthermore, if tweets are gathered through a dictionary-based search, the tweets that contain the same strings as full book titles are often spam. However, assuming a complete list of book titles (i.e. from a union catalogue from a library or commercial bibliographic data from a book store), this task can be solved by text classification. Thus, we proposed a two-step pipeline consisting of spam filtering and TMB classification based on supervised learning with a small amount of labelled data. We constructed optimal classifiers by comparing combinations of four proven supervised learning methods with different features. Given the difficulty of the task, our pipeline performed highly (about 0.7 in terms of F-score).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Yada, S.: Development of a book recommendation system to inspire “infrequent readers”. In: Tuamsuk, K., Jatowt, A., Rasmussen, E. (eds.) ICADL 2014. LNCS, vol. 8839, pp. 399–404. Springer, Heidelberg (2014)
Adobe: Click Here: The State of Online Advertising. Tech. rep., Adobe Systems Incorporated (2013)
Nadeau, D., Sekine, S.: A Survey of Named Entity Eecognition and Classification. Lingvisticae Investigationes 30(1991), 3–26 (2007)
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data SE - 6, pp. 163–222. Springer US, Boston (2012)
Liu, X., Zhang, S., Wei, F., Zhou, M.: Recognizing named entities in tweets. In: 49th Annual Meeting of the Association for Computational Linguistics, pp. 359−367. ACL, June 2011
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI Workshop on Learning for Text Categorization, pp. 41−48 (1998)
Nigam, K.: Using maximum entropy for text classification. In: Workshop on Machine Learning for Information Filtering, pp. 61−67 (1999)
Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3), 273–297 (1995)
Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
McCord, M., Chuah, M.: Spam detection on twitter using traditional classifiers. In: Calero, J.M., Yang, L.T., Mármol, F.G., García Villalba, L.J., Li, A.X., Wang, Y. (eds.) ATC 2011. LNCS, vol. 6906, pp. 175–186. Springer, Heidelberg (2011)
Lea, D.: Detecting spam bots in online social networking sites: a machine learning approach. In: Foresti, S., Jajodia, S. (eds.) Data and Applications Security and Privacy XXIV. LNCS, vol. 6166, pp. 335–342. Springer, Heidelberg (2010)
Go, A., Bhayani, R., Huang, L.: Twitter Sentiment Classification Using Distant Supervision. Tech. rep., Stanford (2009)
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) The 17th International Conference on Language Resources and Evaluation, pp. 1320–1326. ELRA, Valletta (2010)
Prasetyo, P.K., Lo, D., Achananuparp, P., Tian, Y., Lim, E.P.: Automatic classification of software related microblogs. In: 28th International Conference on Software Maintenance, pp. 596−599. IEEE, September 2012
Aramaki, E., Maskawa, S., Morita, M.: Twitter catches the flu: detecting influenza epidemics using twitter. In: The Conference on Empirical Methods in Natural Language Processing, pp. 1568−1576. ACL, Stroudsburg (2011)
Tuarob, S., Tucker, C.S., Salathe, M., Ram, N.: An Ensemble Heterogeneous Classification Methodology for Discovering Health-related Knowledge in Social Media Messages. Journal of Biomedical Informatics 49, 255–268 (2014)
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Conference on Empirical Methods in Natural Language Processing, pp. 1524−1534. ACL, July 2011
Kou, Z., Cohen, W.W., Murphy, R.F.: High-recall Protein Entity Recognition Using a Dictionary. Bioinformatics 21(Suppl 1), i266–i273 (2005)
Yoshida, K., Tsujii, J.: Reranking for biomedical named-entity recognition. In: Workshop on Biological, Translational, and Clinical Language Processing, pp. 209−216. ACL, June 2007
Murai, H., Kawashima, T., Kudou, A.: Quantitative Analysis Concerning the Relationships and Roles of Pronouns in Movie and Theater Critiques (in Japanese). Journal of Japan Society of Information and Knowledge 22(1), 23–43 (2012)
Abekawa, T., Nanba, H., Takamura, H., Okumura, M.: Automatic Extraction of Bibliography with Machine Learning (in Japanese). IPSJ SIG Notes 2003(98), 83–90 (2003)
Kousha, K., Thelwall, M.: An Automatic Method for Extracting Citations from Google Books. Journal of the Association for Information Science and Technology 66(2), 309–320 (2015)
Brin, S.: Extracting patterns and relations from the world wide web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999)
Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in web text. In: International Joint Conference on Artificial Intelligence, pp. 2733−2739 (2007)
Yada, S., Kageura, K.: Categorization of tweets mentioning books based on text clustering (in japanese). In: IEICE Technical Committee of Natural Language Understanding and Models of Communication, pp. 61−66 (2015)
Chinnasamy, D.G., Mohanraj, V.: A Survey on Spam Detection in Twitter. International Journal of Computer Science and Business Informatics 14(1), 92–102 (2014)
Ostrowski, D.A.: Feature selection for twitter classification. In: Eighth International Conference on Semantic Computing, pp. 267−272. IEEE, June 2014
Kashioka, H.: Analysis of synonym obtained from redirection of wikipedia (in japanese). In: The 13th Annual Meeting of the Association for Natural Language Processing, pp. 1094−1096 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Yada, S., Kageura, K. (2015). Identification of Tweets that Mention Books: An Experimental Comparison of Machine Learning Methods. In: Allen, R., Hunter, J., Zeng, M. (eds) Digital Libraries: Providing Quality Information. ICADL 2015. Lecture Notes in Computer Science(), vol 9469. Springer, Cham. https://doi.org/10.1007/978-3-319-27974-9_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-27974-9_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27973-2
Online ISBN: 978-3-319-27974-9
eBook Packages: Computer ScienceComputer Science (R0)