Fixing Weakly Annotated Web Data Using Relational Models

Gelgi, Fatih; Vadrevu, Srinivas; Davulcu, Hasan

doi:10.1007/978-3-540-73597-7_32

Fatih Gelgi¹,
Srinivas Vadrevu¹ &
Hasan Davulcu¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4607))

Included in the following conference series:

International Conference on Web Engineering

952 Accesses
1 Citations

Abstract

In this paper, we present a fast and scalable Bayesian model for improving weakly annotated data – which is typically generated by a (semi) automated information extraction (IE) system from Web documents. Weakly annotated data suffers from two major problems: they (i) might contain incorrect ontological role assignments, and (ii) might have many missing attributes. Our experimental evaluations with the TAP and RoadRunner data sets, and a collection of 20,000 home pages from university, shopping and sports Web sites, indicate that the model described here can improve the accuracy of role assignments from 40% to 85% for template driven sites, from 68% to 87% for non-template driven sites. The Bayesian model is also shown to be useful for improving the performance of IE systems by informing them with additional domain information.

Download to read the full chapter text

Chapter PDF

Effectively Creating Weakly Labeled Training Examples via Approximate Domain Knowledge

Using Weak Supervision to Identify Long-Tail Entities for Knowledge Base Completion

Semantic Rule Filtering for Web-Scale Relation Extraction

Keywords

References

Guha, R., McCool, R.: TAP: A semantic web toolkit. Semantic Web Journal (2003)
Google Scholar
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K.S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: A case for automated large-scale semantic annotation. Journal of Web Semantics 1(1), 115–132 (2003)
Google Scholar
Crescenzi, V., Mecca, G.: Automatic information extraction from large web sites. Journal of ACM 51(5), 731–779 (2004)
Article Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: ACM SIGMOD, Paris, France, pp. 119–130. ACM Press, New York (2004)
Google Scholar
Vadrevu, S., Gelgi, F., Davulcu, H.: Semantic partitioning of web pages. In: WISE, New York, NY, USA, pp. 107–118 (2005)
Google Scholar
Florescu, D.: Managing semi-structured data. Queue 3(8), 18–24 (2005)
Article Google Scholar
Murphy, K.: A brief introduction to graphical models and bayesian networks (1998), Available online at: http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
Chickering, D.M.: Learning bayesian networks is NP-complete. Learning from Data: Artificial Intelligence and Statistics V (1996)
Google Scholar
Gama, J.: Iterative bayes. Theoretical Computer Science 292(2), 417–430 (2003)
Article MATH Google Scholar
Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: IJCAI, pp. 1300–1309 (1999)
Google Scholar
Neville, J., Jensen, D.: Iterative classification in relational data. In: AAAI Workshop on Learning Statistical Models from Relational Data, pp. 13–20 (2000)
Google Scholar
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: ACM SIGMOD, Washington, D.C, pp. 207–216. ACM Press, New York (1993)
Google Scholar
Alpaydin, E.: Introduction to Machine Learning (Chapter 3), pp. 39–59. MIT Press, Cambridge (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Arizona State University, Tempe, AZ, 85287, USA
Fatih Gelgi, Srinivas Vadrevu & Hasan Davulcu

Authors

Fatih Gelgi
View author publications
You can also search for this author in PubMed Google Scholar
Srinivas Vadrevu
View author publications
You can also search for this author in PubMed Google Scholar
Hasan Davulcu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Luciano Baresi Piero Fraternali Geert-Jan Houben

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gelgi, F., Vadrevu, S., Davulcu, H. (2007). Fixing Weakly Annotated Web Data Using Relational Models. In: Baresi, L., Fraternali, P., Houben, GJ. (eds) Web Engineering. ICWE 2007. Lecture Notes in Computer Science, vol 4607. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73597-7_32

Download citation

DOI: https://doi.org/10.1007/978-3-540-73597-7_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73596-0
Online ISBN: 978-3-540-73597-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fixing Weakly Annotated Web Data Using Relational Models

Abstract

Chapter PDF

Similar content being viewed by others

Effectively Creating Weakly Labeled Training Examples via Approximate Domain Knowledge

Using Weak Supervision to Identify Long-Tail Entities for Knowledge Base Completion

Semantic Rule Filtering for Web-Scale Relation Extraction

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Fixing Weakly Annotated Web Data Using Relational Models

Abstract

Chapter PDF

Similar content being viewed by others

Effectively Creating Weakly Labeled Training Examples via Approximate Domain Knowledge

Using Weak Supervision to Identify Long-Tail Entities for Knowledge Base Completion

Semantic Rule Filtering for Web-Scale Relation Extraction

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation