Abstract
Sparsely labeled problem has been arisen in webpage classification field recently. In this paper, we propose a summarization-based ensemble learning method for handling this problem. Directory compression strategy is used to reduce the scale of webpages to be crawled, so existing knowledge can accumulate to guide subsequent classification process. Then, smart summarization strategy is applied to get concise content of webpages, which overcomes the impact of noise and obtains more related text features. Experiments on public dataset reveal that the proposed method can obtain competing classification result when the dataset is labeled sparsely.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Yang, H., Chua, T.S.: Effectiveness of web page classification on finding list answers. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 522–523. ACM (2004)
Yang, H., Chua, T.S.: Web-based list question answering. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 1277. Association for Computational Linguistics(2004)
Nguyen, T.T.S., Lu, H.Y., Lu, J.: Web-page recommendation based on web usage and domain knowledge. IEEE Trans. Knowl. Data Eng. 26(10), 2574–2587 (2014)
Käki, M.: Findex: search result categories help users when document ranking fails. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 131–140. ACM (2005)
Kohlschütter, C., Chirita, P.A., Nejdl, W.: Utility analysis for topically biased pagerank. In: Proceedings of the 16th International Conference on World Wide Web, pp. 1211–1212. ACM (2007)
Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 12 (2009)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Widyantoro, D.H., Yen, J.: Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Trans. Knowl. Data Eng. 17(3), 401–412 (2005)
Gallagher, B., Tong, H., Eliassi-Rad, T., Faloutsos, C.: Using ghost edges for classification in sparsely labeled networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 256–264. ACM (2008)
Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306–315. ACM (2004)
Kan, M.Y.: Web page categorization without the web page. In: World Wide Web Conference Series (2004)
Kan, M.Y., Thi, H.O.N.: Fast webpage classification using url features. In: The 14th ACM International Conference on Information and Knowledge Management (CIKM), pp. 325–326 (2005)
Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely url-based topic classification. In: Proceedings of International Conference on World Wide Web 3(10), pp. 1689–1694 (2009)
Shen, D., Chen, Z., Yang, Q., Zeng, H.J., Zhang, B., Lu, Y., Ma, W.Y.: Web-page classification through summarization. In: Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Li, L., Xu, J., Tong, H., Xiao, W. (2015). Summarization-Based Ensemble Learning Method for Sparsely Labeled Webpage Classification. In: Cai, R., Chen, K., Hong, L., Yang, X., Zhang, R., Zou, L. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9461. Springer, Cham. https://doi.org/10.1007/978-3-319-28121-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-28121-6_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28120-9
Online ISBN: 978-3-319-28121-6
eBook Packages: Computer ScienceComputer Science (R0)