Near-Duplicate Detection for Online-Shops Owners: An FCA-Based Approach

Ignatov, Dmitry I.; Konstantiov, Andrey V.; Chubis, Yana

doi:10.1007/978-3-642-36973-5_69

Dmitry I. Ignatov²³,
Andrey V. Konstantiov²³ &
Yana Chubis²³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7814))

Included in the following conference series:

European Conference on Information Retrieval

2888 Accesses
1 Citations

Abstract

We proposed a prototype of near-duplicate detection system for web-shop owners. It’s a typical situation for this online businesses to buy description of their goods from so-called copyrighters. Copyrighter can cheat from time to time and provide the owner with some almost identical descriptions for different items. In this paper we demonstrated how we can use FCA for fast clustering and revealing such duplicates in real online perfume shop’s datasets.

The study was implemented in the framework of the Basic Research Program at the National Research University Higher School of Economics (IS&SA Lab).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Broder, A.: Identifying and Filtering Near-Duplicate Documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)
Chapter Google Scholar
Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer (1999)
Google Scholar
Grahne, G., Zhu, J.: Efficiently Using Prefix-trees in Mining Frequent Itemsets. In: Proc. FIMI 2003 Workshop (2003)
Google Scholar
Ignatov, D.I., Kuznetsov, S.O.: Frequent Itemset Mining for Clustering Near Duplicate Web Documents. In: Rudolph, S., Dau, F., Kuznetsov, S.O. (eds.) ICCS 2009. LNCS, vol. 5662, pp. 185–200. Springer, Heidelberg (2009)
Chapter Google Scholar
Ilyinsky, S., Kuzmin, M., Melkov, A., Segalovich, I.: An efficient method to detect duplicates of Web documents with the use of inverted index. In: Proc. WWW 2002, Honolulu (2002)
Google Scholar
Kuznetsov, S.O.: Interpretation on Graphs and Complexity Characteristics of a Search for Specific Patterns. Nauchno-Tekhnicheskaya Informatsiya, Seriya 2 23(1), 23–27 (1989)
Google Scholar
Kuznetsov, S.O.: Mathematical aspects of concept analysis. Journal of Mathematical Science 80(2), 1654–1698 (1996)
Article MATH Google Scholar
Karypis, G.: CLUTO. A Clustering Toolkit. University of Minnesota, Department of Computer Science Minneapolis. Technical Report: 02-017 (2003)
Google Scholar
Kolcz, A., Chowdhury, A., Alspector, J.: Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization. In: Kim, W., Kohavi, R., Gehrke, J., DuMouchel, W. (eds.) Proc. KDD 2004, Seattle, pp. 605–610 (2004)
Google Scholar
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient Mining of Association Rules Using Closed Itemset Lattices. Inform. Syst. 24(1), 25–46 (1999)
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov, Andrey V. Konstantiov & Yana Chubis

Authors

Dmitry I. Ignatov
View author publications
You can also search for this author in PubMed Google Scholar
Andrey V. Konstantiov
View author publications
You can also search for this author in PubMed Google Scholar
Yana Chubis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Yandex, Leo Tolstoy, 16, 119021, Moscow, Russia
Pavel Serdyukov & Ilya Segalovich &
Kontur Labs and Ural Federal University, Fonvizina 3-27, 620078, Yekaterinburg, Russia
Pavel Braslavski
National Research University Higher School of Economics (HSE), Pokrovskii bd 11, 109028, Moscow, Russia
Sergei O. Kuznetsov
University of Amsterdam, Turfdraagsterpad 9, 1012 XT, Amsterdam, The Netherlands
Jaap Kamps
Knowledge Media Institute, The Open University, Walton Hall, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Mathematics & Computer Science Department, Emory University, 400 dowman Drive, 30329, Atlanta, GA, USA
Eugene Agichtein
Department of Computer Science, University College London, Gower Street, WC1E 6BT, London, UK
Emine Yilmaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ignatov, D.I., Konstantiov, A.V., Chubis, Y. (2013). Near-Duplicate Detection for Online-Shops Owners: An FCA-Based Approach. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_69

Download citation

DOI: https://doi.org/10.1007/978-3-642-36973-5_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36972-8
Online ISBN: 978-3-642-36973-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics