Skip to main content

Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants

  • Conference paper
String Processing and Information Retrieval (SPIRE 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Included in the following conference series:

Abstract

Untranslatable query keys pose a problem in dictionary-based cross-language information retrieval (CLIR). One solution consists of using approximate string matching methods for finding the spelling variants of the source key among the target database index. In such a setting, it is important to select a matching method suited especially for CLIR. This paper focuses on comparing the effectiveness of several matching methods in a cross-lingual setting. Search words from five domains were expressed in six languages (French, Spanish, Italian, German, Swedish, and Finnish). The target data consisted of the index of an English full-text database. In this setting, we first established the best method among six baseline matching methods for each language pair. Secondly, we tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words. The latter methods consistently outperformed all baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Angell, R.C., Freund, G.E., Willett, P.: Automatic Spelling Correction Using a Trigram Similarity Measure. Information Processing & Managament 4, 255–261 (1983)

    Article  Google Scholar 

  2. Damashek, M.: Gauging Similarity with n-Grams: Language-Independent Sorting, Categorization, and Retrieval of Text. Science 267, 843–848 (1995)

    Article  Google Scholar 

  3. Hull, D., Grefenstette, G.: Querying Across Languages: A Dictionary- Based Approach to Multilingual Information Retrieval. In: Proc. ACM SIGIR, Zürich, Switzerland, pp. 49–57 (1996)

    Google Scholar 

  4. Peters, C.: Cross Language Evaluation Forum (2002), http://clef.iei.pi.cnr.it

  5. Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. HIM, 259–275 (1995)

    Google Scholar 

  6. Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4(3/4), 209–230 (2001)

    Article  MATH  Google Scholar 

  7. Pirkola, A., Keskustalo, H., Leppänen, E., Känsälä, A.-P., Järvelin, K.: Targeted s-Gram Matching: a Novel n-Gram Matching Technique for Cross- and Monolingual Word Form Variants. Information Research 7 (2) (2002), Available at http://InformationR.net/ir/7-2/paper126.html

  8. Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., Järvelin, K.: Fuzzy Translation of Cross-Lingual Spelling Variants. Accepted for ACM SIGIR 2003 (2003)

    Google Scholar 

  9. Robertson, A.M., Willet, P.: Applications of N-Grams in Textual Information Systems. Journal of Documentation 1, 48–69 (1998)

    Article  Google Scholar 

  10. Salosaari, P., Järvelin, K.: MUSIR - A Retrieval Model for Music. Research Notes 1, Department of Information Studies, University of Tampere (1998)

    Google Scholar 

  11. Ullman, J.R.: A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words. Computer Journal 2, 141–147 (1977)

    Article  Google Scholar 

  12. Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Proc. ACM SIGIR, Zürich, Switzerland, pp. 166–173 (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E., Järvelin, K. (2003). Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39984-1_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20177-9

  • Online ISBN: 978-3-540-39984-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics