Abstract
Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are widely applicable to portal creation in other domains.
Article PDF
Similar content being viewed by others
References
Baker D, Hofmann T, McCallum A and Yang Y (1999) A hierarchical probabilistic model for novelty detection in text. Tech. Rep., Just Research. http://www.cs.cmu.edu/»mccallum.
Baum LE (1972) An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3:1-8.
Bellman RE (1957) Dynamic Programming. Princeton University Press, Princeton, NJ.
Bikel DM, Miller S, Schwartz R and Weischedel R (1997) Nymble: A high-performance learning name-finder. In: Procedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97), pp. 194-201.
Blum A and Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT '98), pp. 92-100.
Boyan J, Freitag D and Joachims T (1996) A machine learning architecture for optimizing web search engines. In: AAAI-96 Workshop on Internet-Based Information Systems.
Chakrabarti S, van der Berg M and Dom B (1999) Focused crawling: A new approach to topic-specific Web resource discovery. In: Proceedings of 8th International World Wide Web Conference (WWW8).
Chang H, Cohn D and McCallum A (1999) Creating customized authority lists. http://www.cs.cmu.edu/~mccallum.
Chen SF and Goodman JT (1998) An empirical study of smoothing techniques for language modeling. Tech. Rep. TR-10-98, Computer Science Group, Harvard University.
Cho J, Garcia-Molina H and Page L (1998) Efficient crawling through URL ordering. In: Proceedings of the Seventh World-Wide Web Conference (WWW7).
Cohen W(1998) A web-based information system that reasons with structured collections of text. In: Proceedings of the Second International Conference on Autonomous Agents (Agents '98), pp. 400-407.
Cohen Wand Fan W(1999) Learning page-independent heuristics for extracting data from web pages. In: AAAI Spring Symposium on Intelligent Agents in Cyberspace.
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K and Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pp. 509-516.
Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1-38.
Freitag D and McCallum A (1999) Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.
Giles CL, Bollacker KD and Lawrence S (1998) CiteSeer: An autonomous citation indexing system. In: Digital Libraries 98-Third ACM Conference on Digital Libraries, pp. 89-98.
Hofmann T and Puzicha J (1998) Statistical models for co-occurrence data. Tech. Rep. AI Memo 1625, Artificial Intelligence Laboratory, MIT.
Joachims T, Freitag D and Mitchell T (1997)Webwatcher: A tour guide for theWorldWideWeb. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pp. 770-777.
Kaelbling LP, Littman ML and Moore AW (1996) Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237-285.
Kearns M, Mansour Y and Ng A (2000) Approximate planning in large POMDPs via reusable trajectories. In: Advances in Neural Information Processing Systems 12. The MIT Press.
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46.
Kupiec J (1992) Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6:225-242.
Lawrence S, Giles CL and Bollacker K. (1999) Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67-71.
Leek TR (1997) Information extraction using hidden Markov models. Master's Thesis, UC San Diego.
Lewis DD (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. In: Machine Learning: ECML-98, Tenth European Conference on Machine Learning, pp. 4-15.
McCallum A and Nigam K (1998) A comparison of event models for naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization. http://www.cs.cmu.edu/~mccallum.
McCallum A, Rosenfeld R, Mitchell T and Ng A (1998) Improving text clasification by shrinkage in a hierarchy of classes. In: Machine Learning: Proceedings of the Fifteenth International Conference (ICML '98), pp. 359-367.
McLachlan G and Basford K (1988) Mixture Models. Marcel Dekker, New York.
Menczer F (1997) ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97), pp. 227-235.
Merialdo B (1994) Tagging english text with a probabilistic model. Computational Linguistics, 20(2):155-171.
Mitchell TM (1997) Machine Learning. McGraw-Hill, New York.
Ney H, Essen U and Kneser R (1994) On structuring probabilistic dependencies in stochastic language modeling. Computer Speech and Language, 8(1):1-38.
Nigam K, McCallum A, Thrun S and Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning, 39.
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-286.
Riloff E and Jones R (1999) Learning dictionaries for information extraction using multi-level boot-strapping. In: Proceedings of the Sixteenth National Conference on Artificial Intellligence (AAAI-99), pp.474-479.
Stolcke A, Shriberg E, Bates R, Coccaro N, Jurafsky D, Martin R, Meteer M, Ries K, Taylor P and Ess-Dykema CV (1998) Dialog act modeling for conversational speech. In: AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, pp. 98-105.
Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learning, 3:9-44.
Tesauro G and Galperin GR (1997) On-line policy improvement using monte-carlo search. In: Advances in Neural Information Processing Systems 9, The MIT Press, pp. 1068-1074.
Torgo L and Gama J (1997) Regression using classification algorithms. Intelligent Data Analysis, 1(4):275-292.
Viterbi AJ (1967) Error bounds for convolutional codes and an asymtotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13, 260-269.
Witten IH, Nevill-Manning C, McNab R and Cunnningham SJ (1998) A public digital library based on full-text retrieval: Collections and experience. Communications of the ACM, 41(4):71-75.
Yamron J, Carp I, Gillick L, Lowe S and van Mulbregt, P. (1998) A hidden Markov model approach to text segmentation and event tracking. In: Procedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-98), Seattle, Washington.
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95), pp. 189-196.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
McCallum, A.K., Nigam, K., Rennie, J. et al. Automating the Construction of Internet Portals with Machine Learning. Information Retrieval 3, 127–163 (2000). https://doi.org/10.1023/A:1009953814988
Issue Date:
DOI: https://doi.org/10.1023/A:1009953814988