Cleaning Out Web Spam by Entropy-Based Cascade Outlier Detection

Wei, Sha; Zhu, Yan

doi:10.1007/978-3-319-64471-4_19

Sha Wei¹⁹ &
Yan Zhu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10439))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1063 Accesses
2 Citations

Abstract

Web spam refers to those Web pages where tricks are played to mislead search engines to increase their rank than they really deserved. It causes huge damages on e-commerce and Web users, and threats the Web security. Combating Web spam is an urgent task. In this paper, Web quality and semantic measurements are integrated with the content and link features to construct a more representative characteristic set. A cascade detection mechanism based on entropy-based outlier mining (EOM) algorithm is proposed. The mechanism consists of three stages with different feature groups. The experiments on WEBSPAM-UK2007 show that the quality and semantic features can effectively improve the detection, and the EOM algorithm outperforms many classic classification algorithms under the circumstance of data unbalanced. The cascade detection mechanism can clean out more spam.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. ACM 13(2), 50–64 (2012)
Google Scholar
Cheng, Z., Gao, B., Sun, C., Jiang, Y., Liu, T.: Let web spammers expose themselves. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining, New York, pp. 525–534 (2011)
Google Scholar
Wei, X., Li, C., Chen, H.: Content and link based web spam detection with co-training. J. Frontiers Comput. Sci. Technol. 4, 899–908 (2010)
Google Scholar
Wang, W., Zeng, G., Tang, D.: Using evidence based content trust model for spam detection. Expert Syst. Appl. 37(8), 5599–5606 (2010)
Article Google Scholar
Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural Networks for Web Content Filter. IEEE Intell. Syst. 17, 48–57 (2002)
Article Google Scholar
Dong, C., Zhou, B.: Effectively detecting content spam on the web using topical diversity measures. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 1115–1123 (2012)
Google Scholar
Suhara, Y., Toda, H., Nishioka, S., Susaki, S.: Automatically generated spam detection based on sentence-level topic information. In: Proceedings of the 22nd International Conference on World Wide Web Companion, pp. 1157–1160 (2013)
Google Scholar
Fang, X., Tan, Y., Zheng, X., Zhuang, H., Zhou, S.: Imbalanced web spam classification using self-labeled techniques and multi-classifier models. In: Proceedings of International Conference on Knowledge Science, Engineering and Management, pp. 663–668 (2015)
Google Scholar
Bhowan, U., Johnston, M., Zhang, M.: Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Trans. Syst. Man Cybern. 42, 406–421 (2012)
Article Google Scholar
Bhattacharya, G., Ghosh, K., Chowdhury, A.S.: Outlier detection using neighborhood rank difference. Pattern Recogn. Lett. 60(C), 24–31 (2015)
Google Scholar
Daneshpazhouh, A., Sami, A.: Entropy-based outlier detection using semi-supervised approach with few positive examples. Pattern Recogn. Lett. 49, 77–84 (2014)
Article Google Scholar
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1027 (2010)
Google Scholar
Zhao, B., Zhu, Y.: Formalizing and validating the Web quality model for Web source quality evaluation. Expert Syst. Appl. 41, 3306–3312 (2014)
Article Google Scholar
Wei, S., Zhu, Y.: Combining topic similarity with link weight for Web spam ranking detection. J. Comput. Appl. 36(3), 735–739 (2016). (in Chinese)
Google Scholar
Goh, K.L., Patchmuthu, R.K., Singh, A.K.: Link-based web spam detection using weight properties. J. Intell. Inf. Syst. 43(1), 129–145 (2014)
Article Google Scholar
Krishnan, V., Raj, R.: Web spam detection with anti-trust rank. In: Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, pp. 37–40 (2006)
Google Scholar
McAfee Inc. TrustSource Web Database Reference Guide (category set 4). https://support.mcafee.com/ServicePortal/faces/knowledgecenter. Accessed 29 Nov 2016
Standardization Administration of the People’s Republic of China (SAC). Information security technology—Guidelines for the category and classification of information security incidents. GB/Z 20986-2007 (2013)
Google Scholar
Web Spam Challenge: Results. http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIII. Accessed 29 Nov 2016 (2008)
Bíró, I., Siklósi, D., Szabó, J, Benczúr, A.: Linked latent Dirichlet allocation in web spam filtering. In: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, Madrid, Spain, pp. 37–40 (2009)
Google Scholar

Download references

Acknowledgements

This work was supported by the Academic and Technological Leadership Training Foundation of Sichuan Province, China [WZ0100112371601/004, WZ0100112371408, YH1500411031402].

Author information

Authors and Affiliations

School of Information Science and Technology, Southwest Jiaotong University, Chengdu, 610031, China
Sha Wei & Yan Zhu

Authors

Sha Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Zhu .

Editor information

Editors and Affiliations

University of Lyon, Villeurbanne, France
Djamal Benslimane
University of Milan, Milan, Italy
Ernesto Damiani
University of Michigan, Dearborn, Michigan, USA
William I. Grosky
Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
Wright State University, Dayton, Ohio, USA
Amit Sheth
Johannes Kepler University, Linz, Austria
Roland R. Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, S., Zhu, Y. (2017). Cleaning Out Web Spam by Entropy-Based Cascade Outlier Detection. In: Benslimane, D., Damiani, E., Grosky, W., Hameurlain, A., Sheth, A., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2017. Lecture Notes in Computer Science(), vol 10439. Springer, Cham. https://doi.org/10.1007/978-3-319-64471-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-64471-4_19
Published: 02 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64470-7
Online ISBN: 978-3-319-64471-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics