Skip to main content

Cleaning Out Web Spam by Entropy-Based Cascade Outlier Detection

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10439))

Included in the following conference series:

Abstract

Web spam refers to those Web pages where tricks are played to mislead search engines to increase their rank than they really deserved. It causes huge damages on e-commerce and Web users, and threats the Web security. Combating Web spam is an urgent task. In this paper, Web quality and semantic measurements are integrated with the content and link features to construct a more representative characteristic set. A cascade detection mechanism based on entropy-based outlier mining (EOM) algorithm is proposed. The mechanism consists of three stages with different feature groups. The experiments on WEBSPAM-UK2007 show that the quality and semantic features can effectively improve the detection, and the EOM algorithm outperforms many classic classification algorithms under the circumstance of data unbalanced. The cascade detection mechanism can clean out more spam.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. ACM 13(2), 50–64 (2012)

    Google Scholar 

  2. Cheng, Z., Gao, B., Sun, C., Jiang, Y., Liu, T.: Let web spammers expose themselves. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining, New York, pp. 525–534 (2011)

    Google Scholar 

  3. Wei, X., Li, C., Chen, H.: Content and link based web spam detection with co-training. J. Frontiers Comput. Sci. Technol. 4, 899–908 (2010)

    Google Scholar 

  4. Wang, W., Zeng, G., Tang, D.: Using evidence based content trust model for spam detection. Expert Syst. Appl. 37(8), 5599–5606 (2010)

    Article  Google Scholar 

  5. Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural Networks for Web Content Filter. IEEE Intell. Syst. 17, 48–57 (2002)

    Article  Google Scholar 

  6. Dong, C., Zhou, B.: Effectively detecting content spam on the web using topical diversity measures. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 1115–1123 (2012)

    Google Scholar 

  7. Suhara, Y., Toda, H., Nishioka, S., Susaki, S.: Automatically generated spam detection based on sentence-level topic information. In: Proceedings of the 22nd International Conference on World Wide Web Companion, pp. 1157–1160 (2013)

    Google Scholar 

  8. Fang, X., Tan, Y., Zheng, X., Zhuang, H., Zhou, S.: Imbalanced web spam classification using self-labeled techniques and multi-classifier models. In: Proceedings of International Conference on Knowledge Science, Engineering and Management, pp. 663–668 (2015)

    Google Scholar 

  9. Bhowan, U., Johnston, M., Zhang, M.: Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Trans. Syst. Man Cybern. 42, 406–421 (2012)

    Article  Google Scholar 

  10. Bhattacharya, G., Ghosh, K., Chowdhury, A.S.: Outlier detection using neighborhood rank difference. Pattern Recogn. Lett. 60(C), 24–31 (2015)

    Google Scholar 

  11. Daneshpazhouh, A., Sami, A.: Entropy-based outlier detection using semi-supervised approach with few positive examples. Pattern Recogn. Lett. 49, 77–84 (2014)

    Article  Google Scholar 

  12. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1027 (2010)

    Google Scholar 

  13. Zhao, B., Zhu, Y.: Formalizing and validating the Web quality model for Web source quality evaluation. Expert Syst. Appl. 41, 3306–3312 (2014)

    Article  Google Scholar 

  14. Wei, S., Zhu, Y.: Combining topic similarity with link weight for Web spam ranking detection. J. Comput. Appl. 36(3), 735–739 (2016). (in Chinese)

    Google Scholar 

  15. Goh, K.L., Patchmuthu, R.K., Singh, A.K.: Link-based web spam detection using weight properties. J. Intell. Inf. Syst. 43(1), 129–145 (2014)

    Article  Google Scholar 

  16. Krishnan, V., Raj, R.: Web spam detection with anti-trust rank. In: Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, pp. 37–40 (2006)

    Google Scholar 

  17. McAfee Inc. TrustSource Web Database Reference Guide (category set 4). https://support.mcafee.com/ServicePortal/faces/knowledgecenter. Accessed 29 Nov 2016

  18. Standardization Administration of the People’s Republic of China (SAC). Information security technology—Guidelines for the category and classification of information security incidents. GB/Z 20986-2007 (2013)

    Google Scholar 

  19. Web Spam Challenge: Results. http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIII. Accessed 29 Nov 2016 (2008)

  20. Bíró, I., Siklósi, D., Szabó, J, Benczúr, A.: Linked latent Dirichlet allocation in web spam filtering. In: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, Madrid, Spain, pp. 37–40 (2009)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Academic and Technological Leadership Training Foundation of Sichuan Province, China [WZ0100112371601/004, WZ0100112371408, YH1500411031402].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Wei, S., Zhu, Y. (2017). Cleaning Out Web Spam by Entropy-Based Cascade Outlier Detection. In: Benslimane, D., Damiani, E., Grosky, W., Hameurlain, A., Sheth, A., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2017. Lecture Notes in Computer Science(), vol 10439. Springer, Cham. https://doi.org/10.1007/978-3-319-64471-4_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64471-4_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64470-7

  • Online ISBN: 978-3-319-64471-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics