An Approach of Information Extraction from Web Documents for Automatic Ontology Generation

Yeom, Ki-Won; Park, Ji-Hyung

doi:10.1007/11596448_66

Ki-Won Yeom²⁶ &
Ji-Hyung Park²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3801))

Included in the following conference series:

International Conference on Computational and Information Science

1264 Accesses

Abstract

We examine an automated mechanism, which allows users to access this information in a structured manner by segmenting unformatted text records into structured elements, annotating these documents using XML tags and using specific query processing techniques. This research is the first step to make an automatic ontology generation system. Therefore, we focus on the explanation how we can automatically extract structure when seeded with a small number of training examples. We propose an approach based on Hidden Markov Models to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. We introduce two different HMM models for information extraction from different sources such as bibliography and Call for Papers documents as a training dataset. The proposed HMM learn to distinguish the fields, and then extract title, authors, conference / journal names, etc. from the text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Faure, D., Poibeau, T.: First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX. In: The proceedings of the 14th European Conference on Artificial Intelligence, ECAI 2000, Berlin (2000)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002)
Google Scholar
Larocca Neto, J., Santos, A.D., Kaestner, C.A., Freitas, A.: Document clustering and text summarization. In: Proc. of 4th Int. Conf. Practical Applications of Knowledge Discovery and Data Mining (PADD 2000), pp. 41–55. The Practical Application Company, London (2000)
Google Scholar
Mitra, M., Singhal, A., Buckley, C.: Automatic text summarization by paragraph extraction. In: Proceedings of the ACL 1997/EACL 1997 Workshop on Intelligent Scalable Text Summarization, Madrid (1997)
Google Scholar
Rabiner, L.: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE 77(2) (1999)
Google Scholar
Yaari, Y.: Segmentation of Expository Texts by Hierarchical Agglomerative Clustering. Technical Report, Bar-Ilan University Israel (1997)
Google Scholar
Crespo, A., Jannink, J., Neuhold, E., Rys, M., Studer, R.: A survey of semi-automatic extraction and transformation, http://www-db.stanford.edu/crespo/publications/
Freitag, D., McCallum, A.: Information extraction using HMMs and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 31–36 (1999)
Google Scholar
Liu, L., Pu, C., Han, W.: Xwrap - An xml-enabled wrapper construction system for web information sources. In: International Conference on Data Engineering, pp. 611–621 (2000)
Google Scholar
Stanley, B., Andrew, M.: Machine learning of event segmentation for news on demand. Communications of the ACM 43(2) (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

CAD/CAM Research Center, Korea Institute of Science and Technology, 39-1, Hawolkog-dong, Seongbuk-gu, Seoul, Korea
Ki-Won Yeom & Ji-Hyung Park

Authors

Ki-Won Yeom
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Hyung Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Microelectronic Instiute, Xidian University, 710071, Xi’an, China
Yue Hao
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
Jiming Liu
School of Computer Science and Technology, Xidian University, Xi’an, China
Yuping Wang
Department of Computer Science, Hong Kong Baptist University, Hong Kong,
Yiu-ming Cheung
School of Electrical and Electronic Engineering, University of Manchester, UK
Hujun Yin
Life Science Research Center, School of Electronic Engineering, Xidian University, 710071, Xi’an, Shaanxi, China
Licheng Jiao
Key Laboratory of Computer Networks and Information Security (Ministry of Education), Xidian University, 710071, Xi’an, China
Jianfeng Ma
National Laboratory of Antennas and Microwave Technology, Xidian University, 710071, Xi’an, Shanxi, P.R. China
Yong-Chang Jiao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yeom, KW., Park, JH. (2005). An Approach of Information Extraction from Web Documents for Automatic Ontology Generation. In: Hao, Y., et al. Computational Intelligence and Security. CIS 2005. Lecture Notes in Computer Science(), vol 3801. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11596448_66

Download citation

DOI: https://doi.org/10.1007/11596448_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30818-8
Online ISBN: 978-3-540-31599-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics