Skip to main content
Log in

Improving cis-regulatory elements modeling by consensus scaffolded mixture models

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

A position weight matrix (PWM) is widely accepted as a probabilistic representation for modeling protein-DNA binding specificity. Previous studies showed that for factors which bind to divergent binding sites, mixtures of multiple PWMs improve performance. We propose a consensus scaffolded mixutre PWM (CSM) model to improve cis-regulatory elements modeling by allowing overlapping components represented by a set of PWMs, each of which corresponds to a binding pattern and is scaffolded by a degenerate consensus. In addition, we propose a learning algorithm that involves an initial structure learning stage based on the frequent pattern mining and a refining stage based on the expectation maximization (EM) algorithm. We assess the merits of CSM using three independent criteria. In a case-study of transcription factor Leu3, the derived CSM models agree with conventional mixtures but show better fitness according to Fermi-Dirac distribution. Analysis of the human-mouse conservation of predicted binding sites of 83 JASPAR transcription factors (TFs) shows that the CSM is as good as or better than the simple mixture, the context-specific independent (CSI) mixture, and the single PWM model, for 83%, 84%, and 75% of the cases, respectively. Five-fold cross validation on 46 TRANSFAC datasets shows that CSM model has better generality than other mixture models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Stormo G D. DNA binding sites: representation and discovery. Bioinformatics, 2000, 16: 16–23

    Article  Google Scholar 

  2. Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res, 1984, 12: 505–519

    Article  Google Scholar 

  3. Bulyk M L, Johnson P L, Church G M. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res, 2002, 30: 1255–1261

    Article  Google Scholar 

  4. Zhang M, Marr T. A weight array method for splicing signal analysis. Comput Appl Biosci, 1993, 9: 499–509

    Google Scholar 

  5. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol, 1997, 268: 78–94

    Article  Google Scholar 

  6. Barash Y, Elidan G, Friedman N, et al. Modeling dependencies in protein-DNA binding sites. In: Vingron M, Istrail S, Pevzner P, et al., eds. Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology. New York: ACM press, 2003. 28–37

    Google Scholar 

  7. Ellrott K, Yang C, Sladek F M, et al. Identifying transcription factor binding sites through Markov chain optimization. Bioinformatics, 2002, 18: 100–109

    Article  Google Scholar 

  8. Zhao X, Huang H, Speed T P. Finding short DNA motifs using permuted Markov models. J Comput Biol, 2005, 12: 894–906

    Article  Google Scholar 

  9. Zhou Q, Liu J S. Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 2004, 20: 909–916

    Article  Google Scholar 

  10. Hannenhalli S, Wang L S. Enhanced position weight matrices using mixture models. Bioinformatics, 2005, 21: 204–212

    Article  Google Scholar 

  11. Sandelin A, Alkema W, Engstrom P, et al. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res, 2004, 32: 91–94

    Article  Google Scholar 

  12. Georgi B, Schliep A. Context-specific independence mixture modeling for positional weight matrices. Bioinformatics, 2006, 22: 166–173

    Article  Google Scholar 

  13. Hannenhalli S. Eukaryotic transcription factor binding sites — modeling and integrative search methods. Bioinformatics, 2008, 24: 1325–1331

    Article  Google Scholar 

  14. Wingender E. The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief Bioinform, 2008, 9: 326–332

    Article  Google Scholar 

  15. Liu X, Clarke N D. Rationalization of gene regulation by a eukaryotic transcription factor: calculation of regulatory region occupancy from predicted binding affinities. J Mol Biol, 2002, 323: 1–8

    Article  Google Scholar 

  16. Djordjevic M, Sengupta A M, Shraiman B I. A biophysical approach to transcription factor binding site discovery. Genome Res, 2003, 13: 2381–2390

    Article  Google Scholar 

  17. Thomas J W, Touchman J W, Blakesley R W, et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 2003, 424: 788–793

    Article  Google Scholar 

  18. Kuhn R M, Karolchik D, Zweig A S, et al. The UCSC genome browser database: update 2007. Nucleic Acids Res, 2007, 35: D668–D673

    Article  Google Scholar 

  19. Wakaguri H, Yamashita R, Suzuki Y, et al. DBTSS: database of transcription start sites, progress report 2008. Nucleic Acids Res, 2008, 36: 97–101

    Article  Google Scholar 

  20. Kindermann R, Snell J L, Society A M. Markov Random Fields and their Applications (Contemporary Mathematics Volume 1). Providence: American Mathematical Society, 1980.

  21. Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: Chen W, Naughton J, Bernstein P, eds. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2000. 1–12

    Chapter  Google Scholar 

  22. Hays W L, Winkler R L. Statistics: Probability, Inference, and Decision. New York: Holt, Rinehart and Winston Inc, 1971.

    MATH  Google Scholar 

  23. Mehta C R, Patel N R, Tsiatis A A. Exact significance testing to establish treatment equivalence with ordered categorical data. Biometrics, 1984, 40: 819–825

    Article  MathSciNet  MATH  Google Scholar 

  24. Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B, 1977, 39: 1–38

    MathSciNet  MATH  Google Scholar 

  25. Bailey T L, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Altman R B, Brutlag D L, Karp P, et al., eds. Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology. Menlo Park: AAAI Press, 1994. 28–36

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, H., Zhao, Y., Chen, W. et al. Improving cis-regulatory elements modeling by consensus scaffolded mixture models. Sci. China Inf. Sci. 56, 1–11 (2013). https://doi.org/10.1007/s11432-011-4374-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-011-4374-9

Keywords

Navigation