Skip to main content

XML Document Transformation with Conditional Random Fields

  • Conference paper
Comparative Evaluation of XML Information Retrieval Systems (INEX 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4518))

Abstract

We address the problem of structure mapping that arises in xml data exchange or xml document transformation. Our approach relies on xml annotation with semantic labels that describe local tree editions. We propose xml Conditional Random Fields (xcrfs), a framework for building conditional models for labeling xml documents. We equip xcrfs with efficient algorithms for inference and parameter estimation. We provide theoretical arguments and practical experiments that illustrate their expressivity and efficiency. Experiments on the Structure Mapping movie datasets of the inex xml Document Mining Challenge yield very good results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Doan, A., Halevy, A.Y.: Semantic integration research in the database community: A brief survey. AI magazine 26(1), 83–94 (2005)

    Google Scholar 

  2. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML 2001, pp. 282–289 ( 2001)

    Google Scholar 

  3. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACL, pp. 213–220 ( 2003)

    Google Scholar 

  4. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields. In: Proceedings of CoNLL 2003 (2003)

    Google Scholar 

  5. Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference, pp. 235–242 (2003)

    Google Scholar 

  6. Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: Proceedings of NIPS, pp. 1185–1192 (2004)

    Google Scholar 

  7. Sutton, C., Rohanimanesh, K., McCallum, A.: Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of ICML 2004, pp. 783–790 (2004)

    Google Scholar 

  8. Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In: Introduction to Statistical Relational Learning. Lise getoor and ben taskar edn, MIT Press, Cambridge, MA (2006)

    Google Scholar 

  9. Riezler, S., King, T., Kaplan, R., Crouch, R., Maxwell, J., Johnson, M.: Parsing the wall street journal using a lexical-functional grammar and discriminative estimation techniques. In: Proceedings of 2002, pp. 271–278 ( 2002)

    Google Scholar 

  10. Clark, S., Curran, J.R.: Parsing the WSJ using CCG and log-linear models. In: Proceedings of ACL 2004, pp. 103–110 (2004)

    Google Scholar 

  11. Sutton, C.: Conditional probabilistic context-free grammars. Master’s thesis, University of Massachusetts (2004)

    Google Scholar 

  12. Viola, P., Narasimhan, M.: Learning to extract information from semistructured text using a discriminative context free grammar. In: Proceedings of the ACM SIGIR, pp. 330–337 ( 2005)

    Google Scholar 

  13. Jousse, F., Gilleron, R., Tellier, I., Tommasi, M.: Conditional random fields for XML trees. In: ECML Workshop on Mining and Learning in Graphs (2006)

    Google Scholar 

  14. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine-learning approach. In: Proceedings of the ACM SIGMOD Conference, pp. 509–520 ( 2001)

    Google Scholar 

  15. Carme, J., Gilleron, R., Lemay, A., Niehren, J.: Interactive learning of node selecting tree transducers. Machine Learning 66(1), 33–67 (2007)

    Article  Google Scholar 

  16. Cohen, W., Hurst, M., Jensen, L.: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: Web Document Analysis: Challenges and Opportunities, World Scientific, Singapore (2003)

    Google Scholar 

  17. Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the web. Information Systems 23(8), 521–538 (1998)

    Article  Google Scholar 

  18. Kushmerick, N.: Wrapper Induction for Information Extraction. PhD thesis, University of Washington (1997)

    Google Scholar 

  19. Muslea, I., Minton, S., Knoblock, C.: Active learning with strong and weak views: a case study on wrapper induction. In: IJCAI 2003, pp. 415–420 (2003)

    Google Scholar 

  20. Raeymaekers, S., Bruynooghe, M., Van den Bussche, J.: Learning (k,l)-contextual tree languages for information extraction. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  21. Chidlovskii, B., Fuselier, J.: A probabilistic learning method for xml annotation of documents. In: Proceedings of IJCAI 2005 (2005)

    Google Scholar 

  22. Gallinari, P., Wisniewski, G., Denoyer, L., Maes, F.: Stochastic models for document restructuration. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, Springer, Heidelberg (2005)

    Google Scholar 

  23. Wallach, H.: Efficient training of conditional random fields. Master’s thesis, University of Edinburgh (2002)

    Google Scholar 

  24. Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM Journal on Computing 16(5), 1190–1208 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  25. Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational data. In: Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02), pp. 485–492 ( 2002)

    Google Scholar 

  26. Denoyer, L., Gallinari, P.: Report on the XML mining track at INEX 2005 and INEX 2006. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, Springer, Heidelberg (2006)

    Google Scholar 

  27. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: Proceedings of 32nd Conference on Very Large databases - VLDB, pp. 115–126 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Norbert Fuhr Mounia Lalmas Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gilleron, R., Jousse, F., Tellier, I., Tommasi, M. (2007). XML Document Transformation with Conditional Random Fields. In: Fuhr, N., Lalmas, M., Trotman, A. (eds) Comparative Evaluation of XML Information Retrieval Systems. INEX 2006. Lecture Notes in Computer Science, vol 4518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73888-6_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73888-6_48

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73887-9

  • Online ISBN: 978-3-540-73888-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics