Clustering a Very Large Number of Textual Unstructured Customers’ Reviews in English

Žižka, Jan; Burda, Karel; Dařena, František

doi:10.1007/978-3-642-33185-5_5

Jan Žižka²¹,
Karel Burda²¹ &
František Dařena²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7557))

Included in the following conference series:

International Conference on Artificial Intelligence: Methodology, Systems, and Applications

1052 Accesses
4 Citations

Abstract

Having a very large volume of unstructured text documents representing different opinions without knowing which document belongs to a certain category, clustering can help reveal the classes. The presented research dealt with almost two millions of opinions concerning customers’ (dis)satisfaction with hotel services all over the world. The experiments investigated the automatic building of clusters representing positive and negative opinions. For the given high-dimensional sparse data, the aim was to find a clustering algorithm with a set of its best parameters, similarity and clustering-criterion function, word representation, and the role of stemming. As the given data had the information of belonging to the positive or negative class at its disposal, it was possible to verify the efficiency of various algorithms and parameters. From the entropy viewpoint, the best results were obtained with k-means using the binary representation with the cosine similarity, idf, and H2 criterion function, while stemming played no role.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download/ (July 4, 2012)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1-2), 143–175 (1999)
Google Scholar
Figueiredo, F., Rocha, L., Couto, T., Salles, T., Goncalves, M.A., Meira, W.: Word co-occurrence features for text classification. Information Systems 36, 843–858 (2011)
Article Google Scholar
Ganu, G., Kakodkar, Y., Marian, A.: Improving the quality of predictions using textual information in online user reviews. Information Systems (in press, 2012)
Google Scholar
Ghosh, J., Strehl, A.: Similarity-Based Text Clustering: A Comparative Study. In: Grouping Multidimensional Data, pp. 73–97. Springer, Berlin (2006)
Chapter Google Scholar
Huang, A.: Similarity Measures for Text Document Clustering. In: Proceedings of NZCSRSC, pp. 49–56 (2008)
Google Scholar
Joachims, T.: Learning to classify text using support vector machines. Kluwer Academic Publishers, Norwell (2002)
Book Google Scholar
Karypis, G.: Cluto: A Clustering Toolkit. Technical report, University of Minnesota (2003)
Google Scholar
Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM, New York (2004)
Google Scholar
Li, C., Lin, N.: A Novel Text Clustering Algorithm. Energy Procedia 13, 3583–3588 (2011)
Article MathSciNet Google Scholar
Li, N., Wu, D.D.: Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems 48, 354–368 (2010)
Article Google Scholar
Nie, J.Y.: Cross-Language Information Retrieval. Synthesis Lectures on Human Language Technologies 3(1), 1–125 (2010)
Article Google Scholar
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Škorpil, V., Šťastný, J.: Back-Propagation and K-Means Algorithms Comparison. In: 2006 8th International Conference on Signal Processing Proceedings, pp. 1871–1874. IEEE Press, Guilin (2006)
Google Scholar
Wu, J., Chen, J., Xiong, H., Xie, M.: External validation measures for K-means clustering: A data distribution perspective. Expert Systems with Applications 36(3), 6050–6061 (2009)
Article Google Scholar
Zhao, Y., Karypis, K.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical report, University of Minnesota (2003)
Google Scholar
Žižka, J., Dařena, F.: Mining Significant Words from Customer Opinions Written in Different Natural Languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 211–218. Springer, Heidelberg (2011)
Chapter Google Scholar
Žižka, J., Dařena, F.: Mining Textual Significant Expressions Reflecting Opinions in Natural Languages. In: Proceedings of the 11th International Conference on Intelligent Systems Design and Applications, ISDA 2011, Cordoba, Spain, pp. 136–141. IEEE Press (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, FBE, Mendel University in Brno, Zemědělská 1, 613 00, Brno, Czech Republic
Jan Žižka, Karel Burda & František Dařena

Authors

Jan Žižka
View author publications
You can also search for this author in PubMed Google Scholar
Karel Burda
View author publications
You can also search for this author in PubMed Google Scholar
František Dařena
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, University of Manchester, Oxford Road, M13 9PL, Manchester, UK
Allan Ramsay
Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, 2 Acad. G. Bonchev, 1113, Sofia, Bulgaria
Gennady Agre

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Žižka, J., Burda, K., Dařena, F. (2012). Clustering a Very Large Number of Textual Unstructured Customers’ Reviews in English. In: Ramsay, A., Agre, G. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2012. Lecture Notes in Computer Science(), vol 7557. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33185-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-33185-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33184-8
Online ISBN: 978-3-642-33185-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics