Skip to main content

Large-Scale Sequence Comparison

  • Protocol
  • First Online:
Bioinformatics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1525))

Abstract

There are millions of sequences deposited in genomic databases, and it is an important task to categorize them according to their structural and functional roles. Sequence comparison is a prerequisite for proper categorization of both DNA and protein sequences, and helps in assigning a putative or hypothetical structure and function to a given sequence. There are various methods available for comparing sequences, alignment being first and foremost for sequences with a small number of base pairs as well as for large-scale genome comparison. Various tools are available for performing pairwise large sequence comparison. The best known tools either perform global alignment or generate local alignments between the two sequences. In this chapter we first provide basic information regarding sequence comparison. This is followed by the description of the PAM and BLOSUM matrices that form the basis of sequence comparison. We also give a practical overview of currently available methods such as BLAST and FASTA, followed by a description and overview of tools available for genome comparison including LAGAN, MumMER, BLASTZ, and AVID.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tautz D (1998) Evolutionary biology. Debatable homologies. Nature 395:17–19

    Article  CAS  PubMed  Google Scholar 

  2. Pearson WR (1996) Effective protein sequence comparison. Methods Enzymol 266:227–258

    Article  CAS  PubMed  Google Scholar 

  3. Gibbs AJ, McIntyre GA (1970) The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur J Biochem 16:1–11

    Article  CAS  PubMed  Google Scholar 

  4. Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary changes in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5. National Biomedical Research Foundation, Washington, DC, pp 345–352

    Google Scholar 

  5. Gonnet GH, Cohen MA, Brenner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445

    Article  CAS  PubMed  Google Scholar 

  6. Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of protein mutation data matrices from protein sequences. Cumput Appl Biosci 8:275–282

    CAS  Google Scholar 

  7. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89:10915–10919

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Henikoff S, Henikoff JG (1996) Blocks database and its application. Methods Enzymol 266:88–105

    Article  CAS  PubMed  Google Scholar 

  9. Henikoff S, Henikoff JG (2000) Amino acid substitution matrices. Adv Protein Chem 54:73–97

    Article  CAS  PubMed  Google Scholar 

  10. Henikoff S, Henikoff JG (1991) Automated assembly of protein blocks for database searching. Nucleic Acids Res 19:6565–6572

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Henikoff S, Henikoff JG (1993) Performance evaluation of amino acid substitution matrices. Proteins Struct Funct Genet 17:49–61

    Article  CAS  PubMed  Google Scholar 

  12. Wheeler DG (2003) Selecting the right protein scoring matrix. Curr Protoc Bioinformatics 3.5.1–3.5.6

    Google Scholar 

  13. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in amino acid sequence of two proteins. J Mol Biol 48:443–453

    Article  CAS  PubMed  Google Scholar 

  14. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197

    Article  CAS  PubMed  Google Scholar 

  15. Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26:787–793

    Article  Google Scholar 

  16. Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708

    Article  CAS  PubMed  Google Scholar 

  17. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    Article  CAS  PubMed  Google Scholar 

  18. Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 87:2264–2268

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Altschul SF, Koonin EV (1998) Iterated profile searches with PSI-BLAST: a tool for discovery in protein databases. Trends Biochem Sci 23:444–447

    Article  CAS  PubMed  Google Scholar 

  21. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition based statistics and other refinements. Nucleic Acids Res 29:2994–3005

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Bucher P, Karplus K, Moeri N, Hofmann K (1996) A flexible motif search technique based on generalized profiles. Comput Chem 20:3–23

    Article  CAS  PubMed  Google Scholar 

  23. Staden R (1988) Methods to define and locate patterns of motifs in sequences. Comput Appl Biosci 4:53–60

    CAS  PubMed  Google Scholar 

  24. Tatusov RL, Altschul SF, Koonin EV (1994) Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A 91:12091–12095

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY et al (2009) CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res 37:D205–D210

    Article  CAS  PubMed  Google Scholar 

  26. Zhang Z, Schäffer AA, Miller W, Madden TL, Lipman DJ, Koonin EV, Altschul SF (1998) Protein similarity searches using patterns as seeds. Nucleic Acids Res 26:3986–3990

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Wilbur WJ, Lipman DJ (1983) Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A 80:726–730

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441

    Article  CAS  PubMed  Google Scholar 

  29. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85:2444–2448

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183:63–98

    Article  CAS  PubMed  Google Scholar 

  31. Pearson WR (2003) Finding protein and nucleotide similarities with FASTA. Curr Protoc Bioinformatics 3.9.1–3.9.23

    Google Scholar 

  32. Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132:185–219

    CAS  PubMed  Google Scholar 

  33. Zhang Z, Schwartz S, Wagner L, Miller WA (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203–214

    Article  CAS  PubMed  Google Scholar 

  34. Ma B, Tromp J, Li M (2002) Patternhunter: faster and more sensitive homology search. Bioinformatics 18:440–445

    Article  CAS  PubMed  Google Scholar 

  35. Kent WJ (2002) BLAT-the BLAST like alignment tool. Genome Res 12:656–664

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W (2003) Human–mouse alignments with BLASTZ. Genome Res 13:103–107

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, NISC Comparative Sequencing Program, Green ED, Sidow A, Batzoglou S (2003) LAGAN and multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13:721–731

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Brudno M, Morgenstern B (2002) Fast and sensitive alignment of large genomic sequences. In: Proceedings IEEE computer society bioinformatics conference, Stanford University, pp 138–147

    Google Scholar 

  39. Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes. Nucleic Acids Res 27:2369–2376

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Bray N, Dubchak I, Pachter L (2003) AVID: a global alignment program. Genome Res 13:97–102

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Angiuoli SV, Salzberg SL (2011) Mugsy: fast multiple alignment of closely related whole genome. Bioinformatics 27:334–342

    Article  CAS  PubMed  Google Scholar 

  42. Kent WJ, Zahler AM (2000) Conservation, regulation, synteny, and introns in a large-scale C. briggsae–C. elegans genomic alignment. Genome Res 10:1115–1125

    Article  CAS  PubMed  Google Scholar 

  43. Darling AC, Mau B, Blattner FR, Perna NT (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394–1403

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Nakato R, Gotoh O (2008) A novel method for reducing computational complexity of whole genome sequence alignment. In Proceedings of the sixth Asia-Pacific bioinformatics conference (APBC2008), pp 101–110

    Google Scholar 

  45. Nakato R, Gotoh O (2010) Cgaln: fast and space-efficient whole-genome alignment. BMC Bioinformatics 11:24

    Article  Google Scholar 

  46. Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC (2011) Adaptive seeds tame genomic sequence comparison. Genome Res 21:487–493

    Article  PubMed  PubMed Central  Google Scholar 

  47. Dalca AV, Brudno M (2008) Fresco: flexible alignment with rectangle scoring schemes. Pac Symp Biocomput 13:3–14

    Google Scholar 

  48. Treangen T, Messeguer X (2006) M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species. BMC Bioinformatics 7:433

    Article  PubMed  PubMed Central  Google Scholar 

  49. Sonnhammer EL, Durbin R (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167:GC1–GC10

    Article  CAS  PubMed  Google Scholar 

  50. Brodie R, Roper RL, Upton C (2004) JDotter: a Java interface to multiple dotplots generated by dotter. Bioinformatics 20:279–281

    Article  CAS  PubMed  Google Scholar 

  51. Noe L, Kucherov G (2005) YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res 33:W540–W543

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Junier T, Pagni M (2000) Dotlet: diagonal plots in a web browser. Bioinformatics 16:178–179

    Article  CAS  PubMed  Google Scholar 

  53. Grant JR, Arantes AS, Stothard P (2012) Comparing thousands of circular genomes using the CGView Comparison Tool. BMC Genomics 13:202

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Alikhan NF, Petty NK, Ben Zakour NL, Beatson SA (2011) BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons. BMC Genomics 12:402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mansi Verma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media New York

About this protocol

Cite this protocol

Lal, D., Verma, M. (2017). Large-Scale Sequence Comparison. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1525. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6622-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-6622-6_9

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-6620-2

  • Online ISBN: 978-1-4939-6622-6

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics