Summary
CatS is a meta-search engine that utilizes text classification techniques to improve the presentation of search results. After posting a query, the user is offered an opportunity to refine the results by browsing through a category tree derived from the dmoz Open Directory topic hierarchy. This paper describes some key aspects of the system (including HTML parsing, classification and displaying of results), outlines the text categorization experiments performed in order to choose the right parameters for classification, and puts the system into the context of related work on (meta-)search engines. The approach of using a separate category tree represents an extension of the standard relevance list, and provides a way to refine the search on need, offering the user a non-imposing, but potentially powerful tool for locating needed information quickly and efficiently. The current implementation of CatS may be considered a baseline, on top of which many enhancements are possible.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
D. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6(1):37–66, 1991.
D. Butler. Souped-up search engines. Nature, 405:112–115, May 2000.
H. Chen and S. T. Dumais. Bringing order to the Web: Automatically categorizing search results. In Proceedings of CHI00, Human Factors in Computing Systems, pages 145–152, 2000.
P. Ferragina and A. Gulli. A personalized search engine based on Web-snippet hierarchical clustering. In Proceedings of WWW05, 14th International World Wide Web Conference, pages 801–810, Chiba, Japan, 2005.
Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296, 1999.
P. Jackson and I. Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. John Benjamins, 2002.
S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3):637–649, 2001.
A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes. Multinomial naive bayes for text categorization revisited. In Proceedings of AI2004, 17th Australian Joint Conference on Artificial Intelligence, LNAI 3339, pages 488–499, Cairns, Australia, 2004.
I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In Proceedings of ECML97, 7th European Conference on Machine Learning, pages 412–420, 1997.
D. Lawrie and W. B. Croft. Generating hierarchical summaries for Web searches. In Pro ceedings of SIGIR03, 26th ACM International Conference on Research and Development in Information Retrieval, Toronto, Canada, 2003.
D. Mladenić. Machine Learning on non-homogenous, distributed text data. PhD thesis, University of Ljubljana, Slovenia, 1998.
C. Nadeau and Y. Bengio. Inference for the generalization error. Machine Learning, 52(3), 2003.
S. Osiński and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48–54, 2005.
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Unpublished manuscript, 1998.
J. Platt. Fast training of Support Vector Machines using Sequential Minimal Optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods-Support Vector Learning. MIT Press, 1999.
M. F. Porter. An algorithm for suffix stripping. Program, 14(3): 130–137, 1980.
R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
M. Radovanović. Machine learning in Web mining. Master’s thesis, Department of Mathematics and Informatics, University of Novi Sad, Serbia and Montenegro, 2006. To appear.
M. Radovanović and M. Ivanović. Search based on ontologies. In Proceedings of PRIM2004, 16th Conference on Applied Mathematics, Budva, Serbia and Montenegro, 2004.
M. Radovanović and M. Ivanović. Document representations for classification of short Web-page descriptions. To appear, 2006.
J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the poor assumptions of naive Bayes text classifiers. In Proceedings of ICML03, 20th International Conference on Machine Learning, 2003.
G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971.
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1–47, 2002.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers, 2nd edition, 2005.
Y.-F. Wu and X. Chen. Extracting features from Web search returned hits for hierarchical classification. In Proceedings of IKE03, International Conference on Information and Knowledge Engineering, Las Vegas, Nevada, USA, 2003.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Radovanović, M., Ivanović, M. (2006). CatS: A Classification-Powered Meta-Search Engine. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_20
Download citation
DOI: https://doi.org/10.1007/3-540-33880-2_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33879-6
Online ISBN: 978-3-540-33880-2
eBook Packages: EngineeringEngineering (R0)