Graph Model for Pattern Recognition in Text

Wu, Qin; Fuller, Eddie; Zhang, Cun-Quan

doi:10.1007/978-3-642-13422-7_1

Qin Wu⁴,
Eddie Fuller⁴ &
Cun-Quan Zhang⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 288))

1356 Accesses
1 Citations

Abstract

In this paper, we propose a novel approach that uses a weighted directed multigraph for text pattern recognition. Instead of the traditional model which is based on the frequency of keywords for text classification, we set up a weighted directed multigraph model using the distances between the keywords as the weights of arcs. We then developed a keyword-frequencydistance- based algorithm which not only utilizes the frequency information of keywords but also their ordering information. We applied this new idea to the detection of plagiarized papers and the detection of fraudulent emails written by the same person. The results on these case studies show that this new method performs much better than traditional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Workshop on Learning from text and the Web, Conference on Automated Learning and Discovery (1998)
Google Scholar
Bestgen, Y.: Improving Text Segmentation Using Latent Semantic Analysis: A Reanalysis of Choi, Wiemer-Hastings, and Moore. Computational Linguistics 32(3), 455 (2006)
Article Google Scholar
Hansen, P., Jaumard, B.: Cluster analysis and mathematical programming. Mathematical Programming, 191–215 (1997)
Google Scholar
Hardle, W., Simar, L.: Applied Multivariate Statistical Analysis. Springer, Berlin (2003)
Google Scholar
Hassan, S., Mihalcea, R., Banea, C.: Random-Walk Term Weighting for Improved Text Classification. In: Proceedings of the IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, CA (September 2007)
Google Scholar
Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. John Benjamins Publishing Co., Amsterdam (2002)
Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Lan, M., Tan, C., Low, H., Sungy, S.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Proceedings of the 14th international conference on World Wide Web, pp. 1032–1033 (2005)
Google Scholar
Landauer, T.K., Foltz, P., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Article Google Scholar
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar
Milligan, G.W.: Cluster analysis. In: Kotz, S. (ed.) Encyclopedia of Statistical Sciences, pp. 120–125. Wiley, New York (1998)
Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Ng, H., Goh, W., Low, K.: Feature selection, perceptron learning, and a usability case study for text categorization. In: Proc. 20th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 1997), pp. 67–73 (1997)
Google Scholar
Ou, Y., Zhang, C.-Q.: A new multimembership clustering method. Journal of Industrial and Management Optimization 3(4), 619–624 (2007)
MATH MathSciNet Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. Research and Development in Information Retrieval, pp. 275–281 (1998)
Google Scholar
Robertson, R., Sparck-Jones, K.: Simple, proven approaches to text retrieval. Technical Report (1997)
Google Scholar
Rosario, B.: Latent Semantic Indexing: An overview. INFOSYS 240 (Spring 2000)
Google Scholar
Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)
Article MATH Google Scholar
Schutze, H., Hull, D.A., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington (1995)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research 2, 45–66 (2001)
Article Google Scholar
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-Mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)
Article Google Scholar
Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18, 536–545 (2002)
Article Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorisation methods. In: Proc. 22nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 1999), pp. 67–73 (1999)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, Nashville, US (1997)
Google Scholar
Nigerian Fraud Email Gallery, http://potifos.com/fraud/
http://en.wikipedia.org/wiki/Plagiarism
http://en.wikipedia.org/wiki/D%C4%83nu%C5%A3_Marcu

Download references

Author information

Authors and Affiliations

Department of Mathematics, West Virginia University, Morgantown, WV, 26506-6310, USA
Qin Wu, Eddie Fuller & Cun-Quan Zhang

Authors

Qin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Eddie Fuller
View author publications
You can also search for this author in PubMed Google Scholar
Cun-Quan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Management, National University of Kaohsiung, No. 700 Kaohsiung University Rd, 811, Kaohsiung, Taiwan 5
I-Hsien Ting & Tien-Hwa Ho &
Department of Information Management , National University of Kaohsiung, No. 700 Kaohsiung University Rd, 811, Kaohsiung, Taiwan 5
Hui-Ju Wu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wu, Q., Fuller, E., Zhang, CQ. (2010). Graph Model for Pattern Recognition in Text. In: Ting, IH., Wu, HJ., Ho, TH. (eds) Mining and Analyzing Social Networks. Studies in Computational Intelligence, vol 288. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13422-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-13422-7_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13421-0
Online ISBN: 978-3-642-13422-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics