Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants

Keskustalo, Heikki; Pirkola, Ari; Visala, Kari; Leppänen, Erkka; Järvelin, Kalervo

doi:10.1007/978-3-540-39984-1_19

Heikki Keskustalo⁷,
Ari Pirkola⁷,
Kari Visala⁷,
Erkka Leppänen⁷ &
…
Kalervo Järvelin⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

541 Accesses
18 Citations

Abstract

Untranslatable query keys pose a problem in dictionary-based cross-language information retrieval (CLIR). One solution consists of using approximate string matching methods for finding the spelling variants of the source key among the target database index. In such a setting, it is important to select a matching method suited especially for CLIR. This paper focuses on comparing the effectiveness of several matching methods in a cross-lingual setting. Search words from five domains were expressed in six languages (French, Spanish, Italian, German, Swedish, and Finnish). The target data consisted of the index of an English full-text database. In this setting, we first established the best method among six baseline matching methods for each language pair. Secondly, we tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words. The latter methods consistently outperformed all baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Angell, R.C., Freund, G.E., Willett, P.: Automatic Spelling Correction Using a Trigram Similarity Measure. Information Processing & Managament 4, 255–261 (1983)
Article Google Scholar
Damashek, M.: Gauging Similarity with n-Grams: Language-Independent Sorting, Categorization, and Retrieval of Text. Science 267, 843–848 (1995)
Article Google Scholar
Hull, D., Grefenstette, G.: Querying Across Languages: A Dictionary- Based Approach to Multilingual Information Retrieval. In: Proc. ACM SIGIR, Zürich, Switzerland, pp. 49–57 (1996)
Google Scholar
Peters, C.: Cross Language Evaluation Forum (2002), http://clef.iei.pi.cnr.it
Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. HIM, 259–275 (1995)
Google Scholar
Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4(3/4), 209–230 (2001)
Article MATH Google Scholar
Pirkola, A., Keskustalo, H., Leppänen, E., Känsälä, A.-P., Järvelin, K.: Targeted s-Gram Matching: a Novel n-Gram Matching Technique for Cross- and Monolingual Word Form Variants. Information Research 7 (2) (2002), Available at http://InformationR.net/ir/7-2/paper126.html
Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., Järvelin, K.: Fuzzy Translation of Cross-Lingual Spelling Variants. Accepted for ACM SIGIR 2003 (2003)
Google Scholar
Robertson, A.M., Willet, P.: Applications of N-Grams in Textual Information Systems. Journal of Documentation 1, 48–69 (1998)
Article Google Scholar
Salosaari, P., Järvelin, K.: MUSIR - A Retrieval Model for Music. Research Notes 1, Department of Information Studies, University of Tampere (1998)
Google Scholar
Ullman, J.R.: A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words. Computer Journal 2, 141–147 (1977)
Article Google Scholar
Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Proc. ACM SIGIR, Zürich, Switzerland, pp. 166–173 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Studies, University of Tampere, Finland
Heikki Keskustalo, Ari Pirkola, Kari Visala, Erkka Leppänen & Kalervo Järvelin

Authors

Heikki Keskustalo
View author publications
You can also search for this author in PubMed Google Scholar
Ari Pirkola
View author publications
You can also search for this author in PubMed Google Scholar
Kari Visala
View author publications
You can also search for this author in PubMed Google Scholar
Erkka Leppänen
View author publications
You can also search for this author in PubMed Google Scholar
Kalervo Järvelin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing Science, University of Alberta, Canada
Mario A. Nascimento
Universidade Federal do Amazonas, Manaus, AM, Brasil
Edleno S. de Moura
INESC-ID/IST, R. Alves Redol 9, 1000, Lisboa, Portugal
Arlindo L. Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E., Järvelin, K. (2003). Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_19

Download citation

DOI: https://doi.org/10.1007/978-3-540-39984-1_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics