Abstract
Many entries in major biological databases have incomplete functional annotation and thus, frequently, it is difficult to identify entries for a specific functional category. We combined information of protein functional domains and gene ontology descriptions for highly accurate identification of transcription factor (TF) entries in Swiss-Prot and Entrez Gene databases. Our method utilizes support vector machines and it efficiently separates TF entries from non-TF entries. The 10-fold cross validation of predictions produced on average a positive predictive value of 97.5% and sensitivity of 93.4%. Using this method we have scanned the whole Swiss-Prot and Entrez Gene databases and extracted 13826 unique TF entries. Based on a separate manual test of 500 randomly chosen extracted TF entries, we found that the non-TF (erroneous) entries were present in 2% of the cases.
Chapter PDF
Similar content being viewed by others
Keywords
- Gene Ontology
- Transcription Factor Activity
- Pfam Domain
- Gene Ontology Annotation
- Human Protein Reference Database
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N., Yeh, L.S.: The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–159 (2005)
Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T.: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 33, D54–D58 (2005)
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., et al.: The Pfam protein families database. Nucleic Acids Res. 30, 276–280 (2002)
Harris, M.A., Clark, J., et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258–D261 (2004)
Zupicich, J., Brenner, S.E., Skarnes, W.C.: Computational prediction of membrane-tethered transcription factors. Genome Biol. 2, 50 (2001)
Stegmaier, P., Kel, A.E., Wingender, E.: Systematic DNA-Binding Domain Classification of Transcription Factors. In: Genome Inform. Ser. Workshop, vol. 15(2), pp. 276–286 (2004)
Matys, V., Wingender, E., et al.: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378 (2003)
Peri, S., Navarro, J.D., Pandey, A.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 2363–2371 (2003)
Mewes, H.W., Amid, C., Arnold, R., Frishman, D., Gldener, U., Mannhaupt, G., Mnsterktter, M., Pagel, P., Strack, N., Stmpflen, V., Warfsmann, J., Ruepp, A.: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32, D41–D44 (2004)
Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., Apweiler, R.: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32, D262–D266 (2004)
Scholkopf, B., Burges, C., Smola, A.: Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1990)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Zdobnov, E.M., Apweiler, R.: InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, Z., Veronika, M., Ng, SK., Bajic, V.B. (2006). Intelligent Extraction Versus Advanced Query: Recognize Transcription Factors from Databases. In: Rajapakse, J.C., Wong, L., Acharya, R. (eds) Pattern Recognition in Bioinformatics. PRIB 2006. Lecture Notes in Computer Science(), vol 4146. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11818564_15
Download citation
DOI: https://doi.org/10.1007/11818564_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37446-6
Online ISBN: 978-3-540-37447-3
eBook Packages: Computer ScienceComputer Science (R0)