Abstract
In this paper, we present a fast and scalable Bayesian model for improving weakly annotated data – which is typically generated by a (semi) automated information extraction (IE) system from Web documents. Weakly annotated data suffers from two major problems: they (i) might contain incorrect ontological role assignments, and (ii) might have many missing attributes. Our experimental evaluations with the TAP and RoadRunner data sets, and a collection of 20,000 home pages from university, shopping and sports Web sites, indicate that the model described here can improve the accuracy of role assignments from 40% to 85% for template driven sites, from 68% to 87% for non-template driven sites. The Bayesian model is also shown to be useful for improving the performance of IE systems by informing them with additional domain information.
Chapter PDF
Similar content being viewed by others
References
Guha, R., McCool, R.: TAP: A semantic web toolkit. Semantic Web Journal (2003)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K.S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: A case for automated large-scale semantic annotation. Journal of Web Semantics 1(1), 115–132 (2003)
Crescenzi, V., Mecca, G.: Automatic information extraction from large web sites. Journal of ACM 51(5), 731–779 (2004)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: ACM SIGMOD, Paris, France, pp. 119–130. ACM Press, New York (2004)
Vadrevu, S., Gelgi, F., Davulcu, H.: Semantic partitioning of web pages. In: WISE, New York, NY, USA, pp. 107–118 (2005)
Florescu, D.: Managing semi-structured data. Queue 3(8), 18–24 (2005)
Murphy, K.: A brief introduction to graphical models and bayesian networks (1998), Available online at: http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
Chickering, D.M.: Learning bayesian networks is NP-complete. Learning from Data: Artificial Intelligence and Statistics V (1996)
Gama, J.: Iterative bayes. Theoretical Computer Science 292(2), 417–430 (2003)
Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: IJCAI, pp. 1300–1309 (1999)
Neville, J., Jensen, D.: Iterative classification in relational data. In: AAAI Workshop on Learning Statistical Models from Relational Data, pp. 13–20 (2000)
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: ACM SIGMOD, Washington, D.C, pp. 207–216. ACM Press, New York (1993)
Alpaydin, E.: Introduction to Machine Learning (Chapter 3), pp. 39–59. MIT Press, Cambridge (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gelgi, F., Vadrevu, S., Davulcu, H. (2007). Fixing Weakly Annotated Web Data Using Relational Models. In: Baresi, L., Fraternali, P., Houben, GJ. (eds) Web Engineering. ICWE 2007. Lecture Notes in Computer Science, vol 4607. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73597-7_32
Download citation
DOI: https://doi.org/10.1007/978-3-540-73597-7_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73596-0
Online ISBN: 978-3-540-73597-7
eBook Packages: Computer ScienceComputer Science (R0)