Large-Scale Sequence Comparison

Lal, Devi; Verma, Mansi

doi:10.1007/978-1-4939-6622-6_9

Devi Lal³ &
Mansi Verma⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1525))

6264 Accesses
2 Citations

Abstract

There are millions of sequences deposited in genomic databases, and it is an important task to categorize them according to their structural and functional roles. Sequence comparison is a prerequisite for proper categorization of both DNA and protein sequences, and helps in assigning a putative or hypothetical structure and function to a given sequence. There are various methods available for comparing sequences, alignment being first and foremost for sequences with a small number of base pairs as well as for large-scale genome comparison. Various tools are available for performing pairwise large sequence comparison. The best known tools either perform global alignment or generate local alignments between the two sequences. In this chapter we first provide basic information regarding sequence comparison. This is followed by the description of the PAM and BLOSUM matrices that form the basis of sequence comparison. We also give a practical overview of currently available methods such as BLAST and FASTA, followed by a description and overview of tools available for genome comparison including LAGAN, MumMER, BLASTZ, and AVID.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tautz D (1998) Evolutionary biology. Debatable homologies. Nature 395:17–19
Article CAS PubMed Google Scholar
Pearson WR (1996) Effective protein sequence comparison. Methods Enzymol 266:227–258
Article CAS PubMed Google Scholar
Gibbs AJ, McIntyre GA (1970) The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur J Biochem 16:1–11
Article CAS PubMed Google Scholar
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary changes in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5. National Biomedical Research Foundation, Washington, DC, pp 345–352
Google Scholar
Gonnet GH, Cohen MA, Brenner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445
Article CAS PubMed Google Scholar
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of protein mutation data matrices from protein sequences. Cumput Appl Biosci 8:275–282
CAS Google Scholar
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89:10915–10919
Article CAS PubMed PubMed Central Google Scholar
Henikoff S, Henikoff JG (1996) Blocks database and its application. Methods Enzymol 266:88–105
Article CAS PubMed Google Scholar
Henikoff S, Henikoff JG (2000) Amino acid substitution matrices. Adv Protein Chem 54:73–97
Article CAS PubMed Google Scholar
Henikoff S, Henikoff JG (1991) Automated assembly of protein blocks for database searching. Nucleic Acids Res 19:6565–6572
Article CAS PubMed PubMed Central Google Scholar
Henikoff S, Henikoff JG (1993) Performance evaluation of amino acid substitution matrices. Proteins Struct Funct Genet 17:49–61
Article CAS PubMed Google Scholar
Wheeler DG (2003) Selecting the right protein scoring matrix. Curr Protoc Bioinformatics 3.5.1–3.5.6
Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in amino acid sequence of two proteins. J Mol Biol 48:443–453
Article CAS PubMed Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Article CAS PubMed Google Scholar
Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26:787–793
Article Google Scholar
Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708
Article CAS PubMed Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Article CAS PubMed Google Scholar
Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 87:2264–2268
Article CAS PubMed PubMed Central Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article CAS PubMed PubMed Central Google Scholar
Altschul SF, Koonin EV (1998) Iterated profile searches with PSI-BLAST: a tool for discovery in protein databases. Trends Biochem Sci 23:444–447
Article CAS PubMed Google Scholar
Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition based statistics and other refinements. Nucleic Acids Res 29:2994–3005
Article CAS PubMed PubMed Central Google Scholar
Bucher P, Karplus K, Moeri N, Hofmann K (1996) A flexible motif search technique based on generalized profiles. Comput Chem 20:3–23
Article CAS PubMed Google Scholar
Staden R (1988) Methods to define and locate patterns of motifs in sequences. Comput Appl Biosci 4:53–60
CAS PubMed Google Scholar
Tatusov RL, Altschul SF, Koonin EV (1994) Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A 91:12091–12095
Article CAS PubMed PubMed Central Google Scholar
Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY et al (2009) CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res 37:D205–D210
Article CAS PubMed Google Scholar
Zhang Z, Schäffer AA, Miller W, Madden TL, Lipman DJ, Koonin EV, Altschul SF (1998) Protein similarity searches using patterns as seeds. Nucleic Acids Res 26:3986–3990
Article CAS PubMed PubMed Central Google Scholar
Wilbur WJ, Lipman DJ (1983) Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A 80:726–730
Article CAS PubMed PubMed Central Google Scholar
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441
Article CAS PubMed Google Scholar
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85:2444–2448
Article CAS PubMed PubMed Central Google Scholar
Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183:63–98
Article CAS PubMed Google Scholar
Pearson WR (2003) Finding protein and nucleotide similarities with FASTA. Curr Protoc Bioinformatics 3.9.1–3.9.23
Google Scholar
Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132:185–219
CAS PubMed Google Scholar
Zhang Z, Schwartz S, Wagner L, Miller WA (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203–214
Article CAS PubMed Google Scholar
Ma B, Tromp J, Li M (2002) Patternhunter: faster and more sensitive homology search. Bioinformatics 18:440–445
Article CAS PubMed Google Scholar
Kent WJ (2002) BLAT-the BLAST like alignment tool. Genome Res 12:656–664
Article CAS PubMed PubMed Central Google Scholar
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W (2003) Human–mouse alignments with BLASTZ. Genome Res 13:103–107
Article CAS PubMed PubMed Central Google Scholar
Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, NISC Comparative Sequencing Program, Green ED, Sidow A, Batzoglou S (2003) LAGAN and multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13:721–731
Article CAS PubMed PubMed Central Google Scholar
Brudno M, Morgenstern B (2002) Fast and sensitive alignment of large genomic sequences. In: Proceedings IEEE computer society bioinformatics conference, Stanford University, pp 138–147
Google Scholar
Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes. Nucleic Acids Res 27:2369–2376
Article CAS PubMed PubMed Central Google Scholar
Bray N, Dubchak I, Pachter L (2003) AVID: a global alignment program. Genome Res 13:97–102
Article CAS PubMed PubMed Central Google Scholar
Angiuoli SV, Salzberg SL (2011) Mugsy: fast multiple alignment of closely related whole genome. Bioinformatics 27:334–342
Article CAS PubMed Google Scholar
Kent WJ, Zahler AM (2000) Conservation, regulation, synteny, and introns in a large-scale C. briggsae–C. elegans genomic alignment. Genome Res 10:1115–1125
Article CAS PubMed Google Scholar
Darling AC, Mau B, Blattner FR, Perna NT (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394–1403
Article CAS PubMed PubMed Central Google Scholar
Nakato R, Gotoh O (2008) A novel method for reducing computational complexity of whole genome sequence alignment. In Proceedings of the sixth Asia-Pacific bioinformatics conference (APBC2008), pp 101–110
Google Scholar
Nakato R, Gotoh O (2010) Cgaln: fast and space-efficient whole-genome alignment. BMC Bioinformatics 11:24
Article Google Scholar
Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC (2011) Adaptive seeds tame genomic sequence comparison. Genome Res 21:487–493
Article PubMed PubMed Central Google Scholar
Dalca AV, Brudno M (2008) Fresco: flexible alignment with rectangle scoring schemes. Pac Symp Biocomput 13:3–14
Google Scholar
Treangen T, Messeguer X (2006) M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species. BMC Bioinformatics 7:433
Article PubMed PubMed Central Google Scholar
Sonnhammer EL, Durbin R (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167:GC1–GC10
Article CAS PubMed Google Scholar
Brodie R, Roper RL, Upton C (2004) JDotter: a Java interface to multiple dotplots generated by dotter. Bioinformatics 20:279–281
Article CAS PubMed Google Scholar
Noe L, Kucherov G (2005) YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res 33:W540–W543
Article CAS PubMed PubMed Central Google Scholar
Junier T, Pagni M (2000) Dotlet: diagonal plots in a web browser. Bioinformatics 16:178–179
Article CAS PubMed Google Scholar
Grant JR, Arantes AS, Stothard P (2012) Comparing thousands of circular genomes using the CGView Comparison Tool. BMC Genomics 13:202
Article CAS PubMed PubMed Central Google Scholar
Alikhan NF, Petty NK, Ben Zakour NL, Beatson SA (2011) BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons. BMC Genomics 12:402
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Ramjas College, University of Delhi, New Delhi, 110 007, India
Devi Lal
Sri Venkateswara College, University of Delhi (South Campus), Benito Juarez Road, Dhaula Kuan, New Delhi, 110 021, India
Mansi Verma

Authors

Devi Lal
View author publications
You can also search for this author in PubMed Google Scholar
Mansi Verma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mansi Verma .

Editor information

Editors and Affiliations

Monash University, Melbourne, Victoria, Australia
Jonathan M. Keith

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Lal, D., Verma, M. (2017). Large-Scale Sequence Comparison. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1525. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6622-6_9

Download citation

DOI: https://doi.org/10.1007/978-1-4939-6622-6_9
Published: 29 November 2016
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6620-2
Online ISBN: 978-1-4939-6622-6
eBook Packages: Springer Protocols

Publish with us

Policies and ethics