Abstract
We extend the Forward Search approach for robust data analysis to address problems in text mining. In this domain, datasets are collections of an arbitrary number of documents, which are represented as vectors of thousands of elements according to the vector space model. When the number of variables v is so large and the dataset size n is smaller by order of magnitudes, the traditional Mahalanobis metric cannot be used as a similarity distance between documents. We show that by monitoring the cosine (dis)similarity measure with the Forward Search approach it is possible to perform robust estimation for a document collection and order the documents so that the most dissimilar (possibly outliers, for that collection) are left at the end. We also show that the presence of more groups of documents in the collection is clearly detected with multiple starts of the Forward Search.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Atkinson, A.C., Riani, M.: Robust Diagnostic Regression Analysis. Springer, Berlin (2000)
Atkinson, A.C., Riani, M.: Exploratory tools for clustering multivariate data. Comput. Stat. Data Anal. 52, 272–285 (2007)
Atkinson, A.C., Riani, M., Cerioli, A.: Exploring Multivariate Data with the Forward Search. Springer, Berlin (2004)
Billhardt, H., Borrajo, D., Maojo, V.: A context vector model for information retrieval. J. Am Soc. Inf. Sci. Tec. 53, 236–249 (2002)
Garcia-Escudero, L., Gordaliza, A., Matran, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)
Huang, A.: Similarity measures for text document clustering. In: Proc. of the 6th New Zealand Computer Science Research Student Conference, Christchurch, New Zealand, pp. 49–56 (2008)
Hubert, M., Rousseeuw, P.J., Van Aelst, S.: High-breakdown robust multivariate methods. Stat. Sci. 23, 92–119 (2008)
Mao, W., Chu, W.W.: Free-text medical document retrieval via phrase-based vector space model. In: Proc. of the AMIA Symposium, p. 489 (2002)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus In: Proc. of the Workshop Ontologies and Information Extraction at the EUROLAN 2003, Bucharest, Romania (2003)
Riani, M., Perrotta, D., Torti, F.: FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemometr. Intell. Lab. 116, 17–32 (2012)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inform. Process. Manag. 24, 513–522 (1988)
Steinberger, R., Ebrahim, M., Turchi, M.: JRC Eurovoc Indexer JEX — A freely available multi-label categorisation tool. In: Proc. of the 8th Int. Conf. on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey (2012)
Yates, R.B., Neto, B.R.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Turchi, M., Perrotta, D., Riani, M., Cerioli, A. (2013). Robustness Issues in Text Mining. In: Kruse, R., Berthold, M., Moewes, C., Gil, M., Grzegorzewski, P., Hryniewicz, O. (eds) Synergies of Soft Computing and Statistics for Intelligent Data Analysis. Advances in Intelligent Systems and Computing, vol 190. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33042-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-33042-1_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33041-4
Online ISBN: 978-3-642-33042-1
eBook Packages: EngineeringEngineering (R0)