Skip to main content

Comparing Similarity of HTML Structures and Affiliate IDs in Splog Analysis

  • Conference paper
Database Systems for Adanced Applications (DASFAA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6637))

Included in the following conference series:

  • 1674 Accesses

Abstract

Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting advertisements or raising the number of in-links of target sites. Among those splogs, this paper focuses on detecting a group of splogs which are estimated to be created by an identical spammer. In this paper, we compare two clues: namely, similarity of HTML structures of splogs and affiliate IDs automatically extracted from splogs. We first show that the similarity of HTML structures of splogs is quite effective in splog detection, as well as in identifying spammers. We then show that the identity of affiliate IDs extracted from splogs can identify spammers much more directly than similarity of HTML structures, although it is not easy to achieve high coverage in extracting affiliate IDs. Finally, we show that the coverage of the intersection of the two clues, similarity of HTML structures and affiliate IDs, is relatively low, and it is necessary to apply them in a complementary strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Glance, N., Hurst, M., Tomokiyo, T.: Blogpulse: Automated trend discovery for Weblogs. In: Proc. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2004)

    Google Scholar 

  2. Gyöngyi, Z., Garcia-Molina, H.: Web spam taxonomy. In: Proc. 1st AIRWeb, pp. 39–47 (2005)

    Google Scholar 

  3. Kolari, P., Joshi, A., Finin, T.: Characterizing the splogosphere. In: Proc. 3rd Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2006)

    Google Scholar 

  4. Macdonald, C., Ounis, I.: The TREC Blogs06 collection: Creating and analysing a blog test collection. Technical Report TR-2006-224, University of Glasgow, Department of Computing Science (2006)

    Google Scholar 

  5. Lin, Y.R., Sundaram, H., Chi, Y., Tatemura, J., Tseng, B.L.: Splog detection using self-similarity analysis on blog temporal dynamics. In: Proc. 3rd AIRWeb, pp. 1–8 (2007)

    Google Scholar 

  6. Wang, Y., Ma, M., Niu, Y., Chen, H.: Spam double-funnel: Connecting web spammers with advertisers. In: Proc. 16th WWW, pp. 291–300 (2007)

    Google Scholar 

  7. Sato, Y., Utsuro, T., Fukuhara, T., Kawada, Y., Murakami, Y., Nakagawa, H., Kando, N.: Analyzing features of Japanese splogs and characteristics of keywords. In: Proc. 4th AIRWeb, pp. 33–40 (2008)

    Google Scholar 

  8. Mishne, G., Carmel, D., Lempel, R.: Blocking blog spam with language model disagreement. In: Proc. 1st AIRWeb (2005)

    Google Scholar 

  9. Kolari, P., Finin, T., Joshi, A.: SVMs for the Blogosphere: Blog identification and Splog detection. In: Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pp. 92–99 (2006)

    Google Scholar 

  10. Katayama, T., Yoshinaka, T., Utsuro, T., Kawada, Y., Fukuhara, T.: Detecting splogs using similarities of splog HTML structures. In: Proc. 4th ICUIMC, pp. 256–263 (2010)

    Google Scholar 

  11. Urvoy, T., Lavergne, T., Filoche, P.: Tracking Web spam with hidden style similarity. In: Proc. 2nd AIRWeb, pp. 25–30 (2006)

    Google Scholar 

  12. Fukuhara, T., Kimura, A., Arai, Y., Yoshinaka, T., Masuda, H., Utsuro, T., Nakagawa, H.: KANSHIN: A cross-lingual concern analysis system using multilingual blog articles. In: Proc. 1st INGS 2008, pp. 83–90 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Katayama, T., Morijiri, A., Ishii, S., Utsuro, T., Kawada, Y., Fukuhara, T. (2011). Comparing Similarity of HTML Structures and Affiliate IDs in Splog Analysis. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds) Database Systems for Adanced Applications. DASFAA 2011. Lecture Notes in Computer Science, vol 6637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20244-5_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20244-5_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20243-8

  • Online ISBN: 978-3-642-20244-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics