Skip to main content

Summarization-Based Ensemble Learning Method for Sparsely Labeled Webpage Classification

  • Conference paper
  • First Online:
Web Technologies and Applications (APWeb 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9461))

Included in the following conference series:

  • 637 Accesses

Abstract

Sparsely labeled problem has been arisen in webpage classification field recently. In this paper, we propose a summarization-based ensemble learning method for handling this problem. Directory compression strategy is used to reduce the scale of webpages to be crawled, so existing knowledge can accumulate to guide subsequent classification process. Then, smart summarization strategy is applied to get concise content of webpages, which overcomes the impact of noise and obtains more related text features. Experiments on public dataset reveal that the proposed method can obtain competing classification result when the dataset is labeled sparsely.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Yang, H., Chua, T.S.: Effectiveness of web page classification on finding list answers. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 522–523. ACM (2004)

    Google Scholar 

  2. Yang, H., Chua, T.S.: Web-based list question answering. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 1277. Association for Computational Linguistics(2004)

    Google Scholar 

  3. Nguyen, T.T.S., Lu, H.Y., Lu, J.: Web-page recommendation based on web usage and domain knowledge. IEEE Trans. Knowl. Data Eng. 26(10), 2574–2587 (2014)

    Article  Google Scholar 

  4. Käki, M.: Findex: search result categories help users when document ranking fails. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 131–140. ACM (2005)

    Google Scholar 

  5. Kohlschütter, C., Chirita, P.A., Nejdl, W.: Utility analysis for topically biased pagerank. In: Proceedings of the 16th International Conference on World Wide Web, pp. 1211–1212. ACM (2007)

    Google Scholar 

  6. Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 12 (2009)

    Article  Google Scholar 

  7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  8. Widyantoro, D.H., Yen, J.: Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Trans. Knowl. Data Eng. 17(3), 401–412 (2005)

    Article  Google Scholar 

  9. Gallagher, B., Tong, H., Eliassi-Rad, T., Faloutsos, C.: Using ghost edges for classification in sparsely labeled networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 256–264. ACM (2008)

    Google Scholar 

  10. Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306–315. ACM (2004)

    Google Scholar 

  11. Kan, M.Y.: Web page categorization without the web page. In: World Wide Web Conference Series (2004)

    Google Scholar 

  12. Kan, M.Y., Thi, H.O.N.: Fast webpage classification using url features. In: The 14th ACM International Conference on Information and Knowledge Management (CIKM), pp. 325–326 (2005)

    Google Scholar 

  13. Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely url-based topic classification. In: Proceedings of International Conference on World Wide Web 3(10), pp. 1689–1694 (2009)

    Google Scholar 

  14. Shen, D., Chen, Z., Yang, Q., Zeng, H.J., Zhang, B., Lu, Y., Ma, W.Y.: Web-page classification through summarization. In: Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)

    Google Scholar 

  15. http://www.dmirlab.com/apweb2015/data-challenge.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Le Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, L., Xu, J., Tong, H., Xiao, W. (2015). Summarization-Based Ensemble Learning Method for Sparsely Labeled Webpage Classification. In: Cai, R., Chen, K., Hong, L., Yang, X., Zhang, R., Zou, L. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9461. Springer, Cham. https://doi.org/10.1007/978-3-319-28121-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28121-6_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28120-9

  • Online ISBN: 978-3-319-28121-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics