Large Scale Citation Matching Using Apache Hadoop

Fedoryszak, Mateusz; Tkaczyk, Dominika; Bolikowski, Łukasz

doi:10.1007/978-3-642-40501-3_37

Mateusz Fedoryszak²¹,
Dominika Tkaczyk²¹ &
Łukasz Bolikowski²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8092))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

2649 Accesses
18 Citations
7 Altmetric

Abstract

During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. In this paper we present a citation matching method and show how to scale it up to handle great amounts of data using appropriate indexing and a MapReduce paradigm in the Hadoop environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hitchcock, S.M., Carr, L.A., Harris, S.W., Hey, J.M.N., Hall, W.: Citation Linking: Improving Access to Online Journals. Proceedings of Digital Libraries 97, 115–122 (1997)
Google Scholar
Fedoryszak, M., Bolikowski, Ł., Tkaczyk, D., Wojciechowski, K.: Methodology for Evaluating Citation Parsing and Matching. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intell. Tools for Building a Scientific Information. SCI, vol. 467, pp. 145–154. Springer, Heidelberg (2013)
Chapter Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51(1), 1–13 (2004)
Google Scholar
Paradies, M., Malaika, S., Siméon, J., Khatchadourian, S., Sattler, K.-U.: Entity matching for semistructured data in the Cloud. In: SAC 2012, p. 453. ACM Press (2012)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML 2001, pp. 282–289. Citeseer (2001)
Google Scholar
Tkaczyk, D., Bolikowski, L., Czeczko, A., Rusek, K.: A modular metadata extraction system for born-digital articles. In: Proceedings of the 10th IAPR International Workshop on Document Analysis Systems (2012)
Google Scholar
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92(1), 191–211 (1992)
Article MathSciNet MATH Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2), 83–97 (1955)
Article MathSciNet Google Scholar
McCallum, A.K., Nigam, K., Rennie, J.: Automating the construction of internet portals with machine learning. Information Retrieval, 127–163 (2000)
Google Scholar
Poon, H., Domingos, P.: Joint Inference in Information Extraction. In: Artificial Intelligence, vol. 22, pp. 913–918. AAAI Press (2007)
Google Scholar
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Kawa, A., Bolikowski, Ł., Czeczko, A., Dendek, P.J., Tkaczyk, D.: Data Model for Analysis of Scholarly Documents in the MapReduce Paradigm. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intell. Tools for Building a Scientific Information. SCI, vol. 467, pp. 155–170. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Poland
Mateusz Fedoryszak, Dominika Tkaczyk & Łukasz Bolikowski

Authors

Mateusz Fedoryszak
View author publications
You can also search for this author in PubMed Google Scholar
Dominika Tkaczyk
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Bolikowski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, Norwegian University of Science and Technology, 7491, Trondheim, Norway
Trond Aalberg
Department of Archives and Library Science, Ionian University, 49100, Corfu, Greece
Christos Papatheodorou
Department of Library Information and Archive Sciences, University of Malta, MSD2280, Msida, Malta
Milena Dobreva
Library and Information Center, University of Patras, 26504, Patras, Greece
Giannis Tsakonas
National Archives of Malta, RBT1043, Rabat, Malta
Charles J. Farrugia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł. (2013). Large Scale Citation Matching Using Apache Hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40501-3_37

Download citation

DOI: https://doi.org/10.1007/978-3-642-40501-3_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40500-6
Online ISBN: 978-3-642-40501-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Large Scale Citation Matching Using Apache Hadoop