Abstract
This paper is a manifesto aimed at computer scientists interested in developing and applying scientific discovery methods. It argues that: science is experiencing an unprecedented “explosion” in the amount of available data; traditional data analysis methods cannot deal with this increased quantity of data; there is an urgent need to automate the process of refining scientific data into scientific knowledge; inductive logic programming (ILP) is a data analysis framework well suited for this task; and exciting new scientific discoveries can be achieved using ILP scientific discovery methods. We describe an example of using ILP to analyse a large and complex bioinformatic database that has produced unexpected and interesting scientific results in functional genomics. We then point a possible way forward to integrating machine learning with scientific databases to form intelligent databases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adams, et al.: The genome sequence of Drosophilia Melanogaster. Science 287, 2185–2195 (2000)
Alizadeh, A., et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)
Altschul, S.F., Madden, T.L., Schaffer, A.A, Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acid Res. 25, 3389–3402 (1997)
The Arabidopsis genome initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000)
Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its supplement. TrEMBL Nucleic Acids Research 28, 45–48 (2000)
Blackstock, W.P., Weir, M.P.: Proteomics: quantitative and physical mapping of cellular proteins. Tibtech 17, 121–127 (1999)
Blattner, F.R., et al.: The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1461 (1997)
Boden, M.: Artificial intelligence and natural man. The Harvester Press, Brighton, Sussex (1977)
Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., Yuan, Y.P.: Predicting function: From genes to genomes and back. Journal of Molecular Biology 283, 707–725 (1998)
Bowers, A.F., Giraud-Carrier, C., Lloyd, J.W.: Classification of Individuals with Complex Structure. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 81–88. Morgan Kaufmann, San Francisco (2000)
Brenner, E.: Errors in gene annotation. Trends in Genetics 15, 132–133 (1999)
Brent, R.: Functional genomics: Learning to think about gene expression data. Current Biology 9, 338–R341 (1999)
Brown, P.O., Botstein, D.: Exploring the new world of the genome with DNA microarrays. Nature Genetics 21, 33–37 (1999)
Buchanan, B.G., Sutherland, G.L., Feigenbaum, E.A.: Heuristic DENDRAL: A program for generating explanatory hypotheses in organic chemistry. In: Meltzer, B., Michie, D. (eds.) Machine Intelligence 4, Edinburgh University Press, pp. 209–254 (1969)
Bussey, H.: 1997 ushers in an era of yeast functional genomics. Yeast 13, 1501–1503 (1997)
C. elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282, 2012–2018 (1998)
Cole, S.T., et al.: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537–544 (1998)
Cussens, J.: Parameter estimation in stochastic logic programs. Machine Learning 44, 245–271 (2001)
Dehaspe, L., Toivonen, H., King, R.D.: Finding frequent substructures in chemical compounds. In: The Fourth International Conference on Knowledge Discovery and Data Mining, pp. 30–36. AAAI Press, Menlo Park (1998)
DeRisi, J.L., Iyer, V.R., Brown, P.O.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997)
Dzeroski, S., Blockeel, H., Kompare, B., Kramer, S., Pfahringer, B., Van Laer, W.: Experiments in Predicting Biodegradability. In: Džeroski, S., Flach, P.A. (eds.) Inductive Logic Programming. LNCS (LNAI), vol. 1634, pp. 80–91. Springer, Heidelberg (1999)
Dzeroski, S., Lavrac, N.: Relational Data Mining. Springer, Heidelberg (2001)
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Boston (1996)
Finn, P., Muggleton, S., Page, D., Srinivasan, A.: Pharmacophore discovery using the inductive logic programming system Progol. Machine Learning 30, 241–271 (1998)
Flach, P.A., Giraud-Carrier, C., Llyoyd, J.W.: Strongly typed inductive concept learning. In: Page, D.L. (ed.) Inductive Logic Programming. LNCS, vol. 1446, pp. 185–194. Springer, Heidelberg (1998)
Fujita, H., Yagi, N., Ozaki, T., Furukawa, K.: A new design and implementation of Progol by bottom-up computation. In: Inductive Logic Programming. LNCS, vol. 1314, pp. 163–174. Springer, Heidelberg (1997)
FUNCTION, http://www.aber.ac.uk/~dcswww/Research/bio/ProteinFunction
GenProtEC, http://genprotec.mbl.edu
Gilbert, R.J., Johnson, H.E., Winson, M.K., Rowland, J.J., Goodacre, R., Smith, A.R., Hall, M.A., Kell, D.B.: Genetic programming as an analytical tool for metabolome data. In: Langdon, W.B., Poli, R., Nodin P., Fogarty, T. (eds.): Late-breaking papers of EuroGP-99, Software Engineering, CWI, pp. 23–33 (1999)
Goffeau, A., et al.: Life with 6000 genes. Science 274, 546–567 (1996)
Gordon, A., Sleeman, D., Edwards, P.: Informal Qualitative Models: A Systematic Approach to their Generation. In: Valdes-Perez, R. (ed.) Proceedings of AAAI 1995 Spring Symposium on Systematic Methods of Scientific Discovery, pp. 18–22. AAAI Press, Stanford (1995)
Hieter, P., Boguski, N.: Functional genomics: it’s all how you read it. Science 278, 601–602 (1997)
Humphery-Smith, I., Cordwell, S.J., Blackstock, W.P.: Proteome research: complementarity and limitations with respect to the RNA and DNA worlds. Electrophoresis 18, 1217–1242 (1997)
International human genome sequencing consortium: Initial Sequencing and analysis of the human genome. Nature 409, 860–921 (2001)
Kell, D., King, R.D.: On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. Trends in Biotechnology 18, 93–98 (2000)
Kersting, K., DeRaedt, L.: Bayesian Logic Programs. Linkoping Electronic Articles in Computer and Information Science. 5(034) (2001)
King, R.D., Muggleton, S., Lewis, R.A., Sternberg, M.J.E.: Drug design by machine learning - the use of inductive logic programming to model the structure-activity-relationships of trimethoprim analogs binding to dihydrofolate-reductase. Proceedings of the National Academy of Sciences of the USA 89, 11322–11326 (1992)
King, R.D., Clark, D.A., Shirazi, J., Sternberg, M.J.E.: On the use of machine learning to identify topological rules in the packing of beta-strands. Protein Engineering 7, 1295–1303 (1994)
King, R.D., Muggleton, S.H., Srinivasan, A., Sternberg, M.J.E.: Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proceedings of the National Academy of Sciences of the USA 93, 438–442 (1996)
King, R.D., Karwath, A., Clare, A., Dehapse, L.: Genome scale prediction of protein functional class from sequence using data mining. In: Ramakrishnan, R., Stolfo, S., Bayardo, R., Parsa, I. (eds.) The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The Association for Computing Machinery, New York, USA, pp. 384–389 (2000a)
King, R.D., Karwath, A., Clare, A., Dehapse, L.: Accurate prediction of protein class in the M. tuberculosis and E. coli genomes using data mining. Yeast (Comparative and Functional Genomics) 17, 283–293 (2000b)
King, R.D., Karwath, A., Clare, A., Dehapse, L.: The utility of different representations of protein sequence for predicting functional class. Bioinformatics 17, 445–454 (2001)
Kramer, S., De Raedt, L., Helma, C.: Molecular feature mining in HIV Data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 136–143 (2001)
Kramer, S., Lavrac, N., Flach, P.: Propositionalization approaches to relational data mining. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, Springer, Heidelberg (2001)
Jaynes, E.T.: Probability theory: The logic of Science (1994), http://omega.albany.edu:8008/JaynesBook.html
Langley, P., Simon, H.A., Bradshaw, G.L., Zytkow, J.M.: Scientific Discovery: Computational Explorations of the Creative Process. MIT Press, Cambridge, MA (1987)
Lavrac, N., Dzeroski, S.: Inductive logic programming: techniques and applications. Ellis Horwood, Chichester (1994)
Mannila, H.: Inductive database and condensed representations for data mining. In: Maluszynski, J. (ed.) Proceedings of the International Logic Programming Symposium, pp. 21–30. MIT Press, Cambridge (1997)
Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1, 241–258 (1997)
Mitchell, T.M.: Generalization as search. Artificial Intelligence 18, 203–226 (1982)
Mitchell, T.M.: Machine Learning. McGraw-Hill, London (1997)
Muggleton, S.H.: Inductive Logic Programming. New Generation Computing 8, 295–318 (1990)
Muggleton, S.H.: Inductive Logic Programming. Academic Press, London (1992)
Muggleton, S.: Inverse Entailment and Progol. New Generation Computing Journal 13, 245–286 (1995)
Muggleton, S., King, R.D., Sternberg, M.J.E.: Protein secondary structure prediction using logic-based machine learning. Protein Engineering 5, 647–657 (1992)
Muggleton, S.: Learning Stochastic Logic Programs. Linkoping Electronic Articles in Computer and Information Science 5(041) (2001)
Oliver, S.G., Baganz, F.: The yeast genome: systematic analysis of DNA sequence and biological function. In: Copping, L.G., Dixon, G.K., Livingstone, D.J. (eds.) Genomics: commercial opportunities from a scientific revolution, Bios, pp. 37–51, Oxford (1998)
Ouali, M., King, R.D.: Cascaded multiple classifiers for secondary structure prediction. Protein Science 9, 1162–1176 (2000)
Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the USA 85, 2444–2448 (1988)
Quinlan, R.: C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo (1993)
Rabitz, H., de Vivie-Riedle, R., Motzkus, M., Kompa, K.: Whither the Future of Controlling Quantum Phenomena? Science 288, 824–828 (2000)
Reichardt, T.: It’s sink or swim as a tidal wave of data approaches. Nature 399, 517–520 (1999)
Russel, S.J., Norvig, P.: Artificial Intelligence: A modern approach. Prentice Hall, Englewood Cliffs (1995)
Sleeman, D.H., Stacy, M.K., Edwards, P., Gray, N.A.B.: An Architecture for Theory-Driven Scientific Discovery. In: Morik, K. (ed.) Proceedings of the Fourth European Working Session on Learning, pp. 11–23, Pitman, London (1989)
Srinivasan, A., King, R.D.: Feature construction with Inductive Logic Programming: A study of quantitative predictions of biological activity aided by structural attributes. Data Mining and Knowledge Discovery 3, 37–57 (1999)
Srinivasan, A.: A study of two probabilistic methods for searching large spaces with ILP. Data Mining and Knowledge Discovery 3, 95–123 (2001)
Sternberg, M.J.E., King, R.D., Lewis, R.A., Muggleton, S.: Application of machine learning to structural molecular biology. Philosophical Transactions of the Royal Society of London Series B- Biological Sciences 344, 365–371 (1994)
TB - http://www.sanger.ac.uk/Projects/M_tuberculosis/gene_list_full.shtm
Tukey, J.W.: Exploratory Data Analysis. Addison-Wesley, London (1977)
Turcotte, M., Muggleton, S.H., Sternberg, M.J.E.: The effect of relational background knowledge on learning of protein three-dimensional fold signatures. Machine Learning 12, 81–96 (2001)
Ullman, J.D.: Principles of databases and knowledge-base systems, vol. 1. Computer Science Press, Rockville, MD (1988)
Valdes-Perez, R.E.: Discovery tools for science applications. Communications of the ACM 42, 37–41 (1999)
Venter, J.C., et al.: The sequence of the human genome. Science 291, 1304–1351 (2001)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
King, R.D., Karwath, A., Clare, A., Dehaspe, L. (2007). Logic and the Automatic Acquisition of Scientific Knowledge: An Application to Functional Genomics. In: Džeroski, S., Todorovski, L. (eds) Computational Discovery of Scientific Knowledge. Lecture Notes in Computer Science(), vol 4660. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73920-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-73920-3_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73919-7
Online ISBN: 978-3-540-73920-3
eBook Packages: Computer ScienceComputer Science (R0)