Skip to main content

Large Scale Citation Matching Using Apache Hadoop

  • Conference paper
Research and Advanced Technology for Digital Libraries (TPDL 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8092))

Included in the following conference series:

Abstract

During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. In this paper we present a citation matching method and show how to scale it up to handle great amounts of data using appropriate indexing and a MapReduce paradigm in the Hadoop environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hitchcock, S.M., Carr, L.A., Harris, S.W., Hey, J.M.N., Hall, W.: Citation Linking: Improving Access to Online Journals. Proceedings of Digital Libraries 97, 115–122 (1997)

    Google Scholar 

  2. Fedoryszak, M., Bolikowski, Ł., Tkaczyk, D., Wojciechowski, K.: Methodology for Evaluating Citation Parsing and Matching. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intell. Tools for Building a Scientific Information. SCI, vol. 467, pp. 145–154. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  3. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51(1), 1–13 (2004)

    Google Scholar 

  4. Paradies, M., Malaika, S., Siméon, J., Khatchadourian, S., Sattler, K.-U.: Entity matching for semistructured data in the Cloud. In: SAC 2012, p. 453. ACM Press (2012)

    Google Scholar 

  5. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML 2001, pp. 282–289. Citeseer (2001)

    Google Scholar 

  6. Tkaczyk, D., Bolikowski, L., Czeczko, A., Rusek, K.: A modular metadata extraction system for born-digital articles. In: Proceedings of the 10th IAPR International Workshop on Document Analysis Systems (2012)

    Google Scholar 

  7. Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92(1), 191–211 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  8. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2), 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  9. McCallum, A.K., Nigam, K., Rennie, J.: Automating the construction of internet portals with machine learning. Information Retrieval, 127–163 (2000)

    Google Scholar 

  10. Poon, H., Domingos, P.: Joint Inference in Information Extraction. In: Artificial Intelligence, vol. 22, pp. 913–918. AAAI Press (2007)

    Google Scholar 

  11. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  12. Kawa, A., Bolikowski, Ł., Czeczko, A., Dendek, P.J., Tkaczyk, D.: Data Model for Analysis of Scholarly Documents in the MapReduce Paradigm. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intell. Tools for Building a Scientific Information. SCI, vol. 467, pp. 155–170. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł. (2013). Large Scale Citation Matching Using Apache Hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40501-3_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40501-3_37

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40500-6

  • Online ISBN: 978-3-642-40501-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics