Abstract
Search Engine Optimization (SEO) is a set of techniques that help website operators increase the visibility of their webpages to search engine users. However, there are also many unethical practices that abuse ranking algorithms of a search engine to promote illegal online content, called blackhat SEO. In this paper, we make the first attempt to systematically investigate a recent trend in blackhat SEO, semantic confusion, which mingles the content of a webpage to deceive existing detection of blackhat SEO. In particular, from a new perspective of content semantics, we propose an effective defense against the semantic confusion based blackhat SEO. We built a prototype of our defense called SCDS, and then we validated its effectiveness based on 4.5 million domains randomly selected from 11 zone files and passive DNS records. Our evaluation results show that SCDS can detect more than 82 thousand blackhat SEO websites with a precision of 98.35%. We further analyzed 57,477 long-tail keywords promoted by blackhat SEO and found more than 157 SEO campaigns. Finally, we deployed SCDS into the gateway of a campus network for ten months and detected 23,093 domains with malicious semantic confusion content, showing the effectiveness of SCDS in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The dataset that has been widely used in text-related studies (http://thuctc.thunlp.org/ [in Chinese]). Note that although the dataset itself was compiled based on the News pages from 2005 to 2011, the semantics of the language remains significantly stable and the accuracy of text classification holds too.
- 2.
http://magnet-uri.sourceforge.net/, a URI-scheme in P2P file sharing for enabling resources to be referred to without an available host.
- 3.
https://www.bittorrent.com/, a popular file-sharing P2P tool based on distributed hash table (DHT) method.
- 4.
QiAnXin is a leading Cybersecurity company in China (https://en.qianxin.com).
References
Beautiful soup documentation - beautiful soup 4.9.0 documentation (2020). https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Github - fxsjy/jieba. https://github.com/fxsjy/jieba (2020)
Keras: the Python deep learning API (2020). https://keras.io/
SeleniumHQ Browser Automation (2020). https://www.selenium.dev/
Textcnn - pytorch and keras — kaggle (2020). https://www.kaggle.com/mlwhiz/textcnn-pytorch-and-keras
Chung, Y.j., Toyoda, M., Kitsuregawa, M.: A study of link farm distribution and evolution using a time series of web snapshots. In: International Workshop on Adversarial Information Retrieval on the Web (2009)
Cormack, G.V.: Email spam filtering: a systematic review. Found. Trends Inf. Retr. 1(4), 335–455 (2007)
Du, K., Yang, H., Li, Z., Duan, H., Zhang, K.: The ever-changing labyrinth: a large-scale analysis of wildcard DNS powered blackhat SEO. In: USENIX Security (2016)
Enge, E., Spencer, S., Fishkin, R., Stricchiola, J.: The Art of SEO. O’Reilly Media, Inc. (2012)
Farsight (2020). https://www.farsightsecurity.com/
Fishkin, R.: Indexation for SEO: Real Numbers in 5 Easy Steps (2010). https://moz.com/blog/indexation-for-seo-real-numbers-in-5-easy-steps
Google: Search Engine Optimization Starter Guide (2008). http://static.googleusercontent.com/media/www.google.com/en//webmasters/docs/search-engine-optimization-starter-guide.pdf
Google: Search Engine Optimization (SEO) Starter Guide (2020). https://support.google.com/webmasters/answer/7451184?hl=en
ICANN: Data Protection/Privacy Issues (2018). https://www.icann.org/dataprotectionprivacy
Invernizzi, L., Thomas, K., Kapravelos, A., Comanescu, O., Picod, J.M., Bursztein, E.: Cloak of visibility: detecting when machines browse a different web. In: IEEE S&P (2016)
John, J.P., Yu, F., Xie, Y., Krishnamurthy, A., Abadi, M.: deSEO: combating search-result poisoning. In: USENIX Security (2011)
Leontiadis, N., Moore, T., Christin, N.: A nearly four-year longitudinal study of search-engine poisoning. In: ACM CCS (2014)
Liao, X., Liu, C., McCoy, D., Shi, E., Hao, S., Beyah, R.A.: Characterizing Long-tail SEO Spam on Cloud Web Hosting Services. In: WWW (2016)
Liao, X., et al.: Seeking nonsense, looking for trouble: efficient promotional-infection detection through semantic inconsistency search. In: IEEE S&P (2016)
Lu, L., Perdisci, R., Lee, W.: Surf: detecting and measuring search poisoning. In: ACM CCS (2011)
MaxMind (2020). https://www.maxmind.com/en/home
Motoyama, M., McCoy, D., Levchenko, K., Savage, S., Voelker, G.M.: An analysis of underground forums. In: ACM IMC (2011)
Nilizadeh, S., et al.: Poised: spotting twitter spam off the beaten paths. In: ACM CCS (2017)
SEOmoz: Google Algorithm Change History (2016). https://moz.com/google-algorithm-change
Tu, H., Doupé, A., Zhao, Z., Ahn, G.J.: Users really do answer telephone scams. In: USENIX Security (2019)
Wang, D.Y., Der, M., Karami, M., Saul, L., McCoy, D., Savage, S., Voelker, G.M.: Search+seizure: the effectiveness of interventions on SEO campaigns. In: ACM IMC (2014)
Wang, D.Y., Savage, S., Voelker, G.M.: Juice: a Longitudinal Study of an SEO Botnet. In: NDSS (2013)
Yang, H., et al.: How to learn klingon without a dictionary: detection and measurement of black keywords used by the underground economy. In: IEEE S&P (2017)
Yuan, K., Lu, H., Liao, X., Wang, X.: Reading thieves’ cant: automatically identifying and understanding Dark Jargons from cybercrime marketplaces. In: USENIX Security (2018)
Zhang, Q., Wang, D.Y., Voelker, G.M.: DSpin: detecting automatically spun content on the web. In: NDSS (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
A Practices of Semantic Confusion
We further analyzed the method of embedding mingled semantics in SEO pages and identified three main categories: (1) Modify only the \(\texttt {<\!title\!>}\) tag, and keep all other parts the same. This is because search engines often pay more attention to the \(\texttt {<\!title\!>}\) tag and the text in the title has a higher probability to be indexed. Also, blackhat SEOers do not want to be detected due to the modification of the pages or the mixed illegal content. (2) Embed the same promotion keywords into different paragraphs repeatedly. Appropriate repeats are helpful for search engines to extract keywords and give them a higher rank. These two methods mainly target search engines. (3) Replace the total paragraph with promotion content. This category mainly targets visitors and aims to attract them immediately upon arrival at the webpage. The replaced content is short-lived because it is easily noticed by search engines and webmasters.
B Keyword Promotion in Other Platforms
GitHub. In our study, we found that some blackhat SEOers are promoting GitHub repositories of gambling software development services. Specifically, when we searched “github.com+[gambling keywords]” in Google, the results show many GitHub repositories introducing gambling software, and the descriptions of these repositories include the developer’s contact information (e.g., phone numbers and IM IDs). Another practice is to place a large number of gambling keywords in a repository’s introduction through which search engines can index them. For example, Fig. 12 shows the search results of a gambling keyword “ ” (a dark jargon that means “Live Dealer Casino Games” in underground gambling business) with “github.com” in Google. The top results are mostly GitHub repositories that promote gambling development services.
E-Commerce. We also found that the keywords promoted by blackhat SEO pages were not only used in search engines, but they also appeared on E-Commerce websites. For example, when we searched the most frequent keyword, pk10 on jd.com (a well-known E-Commerce website in China), there were shops that provide illegal gambling software development services, as shown in Fig. 13. Therefore, we recommend that E-Commerce websites should also pay attention to the identification and purification of such activities related to the underground business in search results.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, H. et al. (2021). Mingling of Clear and Muddy Water: Understanding and Detecting Semantic Confusion in Blackhat SEO. In: Bertino, E., Shulman, H., Waidner, M. (eds) Computer Security – ESORICS 2021. ESORICS 2021. Lecture Notes in Computer Science(), vol 12972. Springer, Cham. https://doi.org/10.1007/978-3-030-88418-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-88418-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88417-8
Online ISBN: 978-3-030-88418-5
eBook Packages: Computer ScienceComputer Science (R0)