Skip to main content

Near-Duplicate Detection for Online-Shops Owners: An FCA-Based Approach

  • Conference paper
Advances in Information Retrieval (ECIR 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7814))

Included in the following conference series:

Abstract

We proposed a prototype of near-duplicate detection system for web-shop owners. It’s a typical situation for this online businesses to buy description of their goods from so-called copyrighters. Copyrighter can cheat from time to time and provide the owner with some almost identical descriptions for different items. In this paper we demonstrated how we can use FCA for fast clustering and revealing such duplicates in real online perfume shop’s datasets.

The study was implemented in the framework of the Basic Research Program at the National Research University Higher School of Economics (IS&SA Lab).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Broder, A.: Identifying and Filtering Near-Duplicate Documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  2. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer (1999)

    Google Scholar 

  3. Grahne, G., Zhu, J.: Efficiently Using Prefix-trees in Mining Frequent Itemsets. In: Proc. FIMI 2003 Workshop (2003)

    Google Scholar 

  4. Ignatov, D.I., Kuznetsov, S.O.: Frequent Itemset Mining for Clustering Near Duplicate Web Documents. In: Rudolph, S., Dau, F., Kuznetsov, S.O. (eds.) ICCS 2009. LNCS, vol. 5662, pp. 185–200. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  5. Ilyinsky, S., Kuzmin, M., Melkov, A., Segalovich, I.: An efficient method to detect duplicates of Web documents with the use of inverted index. In: Proc. WWW 2002, Honolulu (2002)

    Google Scholar 

  6. Kuznetsov, S.O.: Interpretation on Graphs and Complexity Characteristics of a Search for Specific Patterns. Nauchno-Tekhnicheskaya Informatsiya, Seriya 2 23(1), 23–27 (1989)

    Google Scholar 

  7. Kuznetsov, S.O.: Mathematical aspects of concept analysis. Journal of Mathematical Science 80(2), 1654–1698 (1996)

    Article  MATH  Google Scholar 

  8. Karypis, G.: CLUTO. A Clustering Toolkit. University of Minnesota, Department of Computer Science Minneapolis. Technical Report: 02-017 (2003)

    Google Scholar 

  9. Kolcz, A., Chowdhury, A., Alspector, J.: Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization. In: Kim, W., Kohavi, R., Gehrke, J., DuMouchel, W. (eds.) Proc. KDD 2004, Seattle, pp. 605–610 (2004)

    Google Scholar 

  10. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient Mining of Association Rules Using Closed Itemset Lattices. Inform. Syst. 24(1), 25–46 (1999)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ignatov, D.I., Konstantiov, A.V., Chubis, Y. (2013). Near-Duplicate Detection for Online-Shops Owners: An FCA-Based Approach. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_69

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36973-5_69

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36972-8

  • Online ISBN: 978-3-642-36973-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics