Skip to main content

Text Indexing with Errors

  • Conference paper
Combinatorial Pattern Matching (CPM 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3537))

Included in the following conference series:

Abstract

In this paper we address the problem of constructing an index for a text document or a collection of documents to answer various questions about the occurrences of a pattern when allowing a constant number of errors. In particular, our index can be built to report all occurrences, all positions, or all documents where a pattern occurs in time linear in the size of the query string and the number of results. This improves over previous work where the lookup time is not linear or depends upon the size of the document corpus. Our data structure has size \(O\left(n\log^k n\right)\) on average and with high probability for input size n and queries with up to k errors. Additionally, we present a trade-off between query time and index complexity that achieves worst-case bounded index size and preprocessing time with linear lookup time on average.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouelhoda, M.I., Ohlebusch, E., Kurtz, S.: Optimal exact string matching based on suffix arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  2. Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Indexing and dictionary matching with one error. J. Algorithms 37, 309–325 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  3. Barkol, O., Rabani, Y.: Tighter bounds for nearest neighbor search and related problems in the cell probe model. In: Proc. 32nd ACM Symp. on Theory of Computing (STOC), pp. 388–396. ACM Press, New York (2000)

    Google Scholar 

  4. Borodin, A., Ostrovsky, R., Rabani, Y.: Lower bounds for high dimensional nearest neighbor search and related problems. In: Proc. 31st ACM Symp. on Theory of Computing (STOC), pp. 312–321. ACM Press, New York (1999)

    Google Scholar 

  5. Brodal, G.S., Ga̧sieniec, L.: Approximate dictionary queries. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 65–74. Springer, Heidelberg (1996)

    Google Scholar 

  6. Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.: Range searching over tree cross products. In: Paterson, M. (ed.) ESA 2000. LNCS, vol. 1879, pp. 120–131. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  7. Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12, 327–344 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  8. Chávez, E., Navarro, G.: A metric index for approximate string matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 181–195. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  9. Cobbs, A.L.: Fast approximate matching using suffix trees. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 41–54. Springer, Heidelberg (1995)

    Google Scholar 

  10. Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. 36th ACM Symp. on Theory of Computing (STOC), pp. 91–100 (2004)

    Google Scholar 

  11. Demaine, E.D., López-Ortiz, A.: A linear lower bound on index size for text retrieval. In: Proc. 12th ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 289–294. ACM, New York (2001)

    Google Scholar 

  12. Gabow, H.N., Bentely, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proc. 16th ACM Symp. on Theory of Computing (STOC), pp. 135–143. ACM, New York (1984)

    Google Scholar 

  13. Gabriele, A., Mignosi, F., Restivo, A., Sciortino, M.: Indexing structures for approximate string matching. In: Petreschi, R., Persiano, G., Silvestri, R. (eds.) CIAC 2003. LNCS, vol. 2653, pp. 140–151. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  14. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Comp. Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  Google Scholar 

  15. Indyk, P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds.) Handbook of Discrete and Computational Geometry, ch. 39, 2nd edn. CRC Press LLC, Boca Raton (2004)

    Google Scholar 

  16. Maaß, M.G.: Average-case analysis of approximate trie search. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 472–483. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  17. Maaß, M.G., Nowak, J.: Text indexing with errors. Technical Report TUMI0503, Fakultät für Informatik, TU München (Mar 2005)

    Google Scholar 

  18. Maaß, M.G., Nowak, J.: A new method for approximate indexing and dictionary lookup with one error. Information Processing Letters (IPL) (to be published)

    Google Scholar 

  19. McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  20. Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc. 13th ACM-SIAM Symp. on Discrete Algorithms (SODA), ACM/SIAM (2002)

    Google Scholar 

  21. Myers, E.W.: A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  22. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  23. Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1(1), 205–209 (2000); Special issue on Matching Patterns

    MathSciNet  Google Scholar 

  24. Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)

    Google Scholar 

  25. Nowak, J.: A new indexing method for approximate pattern matching with one mismatch. Master’s thesis, Fakultät für Informatik, Technische Universität München, Boltzmannstr. 3, D-85748 Garching (February 2004)

    Google Scholar 

  26. Pittel, B.: Asymptotical growth of a class of random trees. Annals of Probability 13(2), 414–427 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  27. Szpankowski, W.: Asymptotic properties of data compression and suffix trees. IEEE Transact. on Information Theory 39(5), 1647–1659 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  28. Szpankowski, W.: Average Case Analysis of Algorithms on Sequences, 1st edn. Wiley-Interscience, Chichester (2000)

    Google Scholar 

  29. Ukkonen, E.: Algorithms for approximate string matching. Information and Control 64, 100–118 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  30. Ukkonen, E.: Approximate string-matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1993. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)

    Chapter  Google Scholar 

  31. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14, 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  32. Weiner, P.: Linear pattern matching. In: Proc. 14th IEEE Symp. on Switching and Automata Theory, pp. 1–11. IEEE, Los Alamitos (1973)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Maaß, M.G., Nowak, J. (2005). Text Indexing with Errors. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_3

Download citation

  • DOI: https://doi.org/10.1007/11496656_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26201-5

  • Online ISBN: 978-3-540-31562-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics