Data Extraction Using NLP Techniques and Its Transformation to Linked Data

Kríž, Vincent; Hladká, Barbora; Nečaský, Martin; Knap, Tomáš

doi:10.1007/978-3-319-13647-9_13

Vincent Kríž²²,
Barbora Hladká²²,
Martin Nečaský²³ &
…
Tomáš Knap²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8856))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1816 Accesses
3 Citations

Abstract

We present a system that extracts a knowledge base from raw unstructured texts that is designed as a set of entities and their relations and represented in an ontological framework. The extraction pipeline processes input texts by linguistically-aware tools and extracts entities and relations from their syntactic representation. Consequently, the extracted data is represented according to the Linked Data principles. The system is designed both domain and language independent and provides users with data for more intelligent search than full-text search. We present our first case study on processing Czech legal texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gantz, J., Reinsel, D.: The digital universe decade - are you ready? (2010), http://goo.gl/ZaO0PR
Lassila, O., Swick, R.R.: Resource description framework (RDF) model and syntax specification. Technical report (1999), http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/
Nečaský, M., Knap, T., Klímek, J., Holubová, I., Vidová-Hladká, B.: Linked open data for legislative domain - ontology and experimental data. In: Abramowicz, W. (ed.) BIS Workshops 2013. LNBIP, vol. 160, pp. 172–183. Springer, Heidelberg (2013)
Chapter Google Scholar
Berners-Lee, T., Hendler, J., Lassila, O., et al.: The semantic web. Scientific American 284, 28–37 (2001)
Article Google Scholar
Biemann, C.: Ontology learning from text: A survey of methods. In: LDV forum, vol. 20, pp. 75–93 (2005)
Google Scholar
Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: Proceedings of the Fifth ACM Conference on Digital Libraries, DL 2000, pp. 85–94. ACM, New York (2000)
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in Knowitall (preliminary results). In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 100–110. ACM, New York (2004)
Google Scholar
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI (2010)
Google Scholar
Banko, M., Etzioni, O.: Strategies for lifelong knowledge extraction from the web. In: Proceedings of the 4th International Conference on Knowledge Capture, K-CAP 2007, pp. 95–102. ACM, New York (2007)
Google Scholar
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics (2011)
Google Scholar
Suchanek, F.M., Sozio, M., Weikum, G.: Sofie: a self-organizing framework for information extraction. In: Proceedings of the 18th International Conference on World Wide Web, pp. 631–640. ACM (2009)
Google Scholar
Abacha, A.B., Zweigenbaum, P.: Automatic extraction of semantic relations between medical entities: a rule based approach. J. Biomedical Semantics 2, S4 (2011)
Google Scholar
Exner, P., Nugues, P.: Entity extraction: From unstructured text to dbpedia rdf triples. In: The Web of Linked Entities Workshop, WoLE 2012 (2012)
Google Scholar
Baisa, V., Kovář, V.: Information extraction for czech based on syntactic analysis. In: Vetulani, Z. (ed.) Proceedings of 5th Language and Technology Conference on Human Language Technologies as a Challenge for Computer Science and Linguistics, Pozna, Funcacja Universytetu im. A. Mickiewicza, pp. 466–470 (2011)
Google Scholar
Biagioli, C., Francesconi, E., Passerini, A., Montemagni, S., Soria, C.: Automatic semantics extraction in law documents. In: Proceedings of the 10th International Conference on Artificial Intelligence and Law, pp. 133–140. ACM (2005)
Google Scholar
Chiarcos, C., Hellmann, S., Nordhoff, S.: Introduction and overview. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked Data in Linguistics, pp. 1–12. Springer, Heidelberg (2012)
Chapter Google Scholar
Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.): Semantic Processing of Legal Texts. LNCS, vol. 6036. Springer, Heidelberg (2010)
Google Scholar
McCarty, L.T.: Deep semantic interpretations of legal texts. In: Proceedings of the 11th International Conference on Artificial Intelligence and Law, ICAIL 2007, pp. 217–224. ACM, New York (2007)
Google Scholar
Dell’Orletta, F., Marchi, S., Montemagni, S., Plank, B., Venturi, G.: The splet–2012 shared task on dependency parsing of legal texts. In: Proceedings of the 4th Workshop on Semantic Processing of Legal Texts 2012, Istanbul, Turkey (2012)
Google Scholar
Pala, K., Rychlý, P., Šmerk, P.: Automatic identification of legal terms in czech law texts. In: Semantic Processing of Legal Texts, pp. 83–94. Springer, Berlin (2010)
Chapter Google Scholar
Pala, K., Mráková, E.: Legal terms and word sketches: a case study. In: Sojka, P., Horák, A. (eds.) Proceedings of Fourth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2010, Brno, Tribun s.r.o, pp. 31–39 (2010)
Google Scholar
Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., Ševčíková-Razímová, M.: Prague dependency treebank 2.0 (2006)
Google Scholar
Bejček, E., Hajičová, E., Hajič, J., Jínová, P., Kettnerová, V., Kolářová, V., Mikulová, M., Mírovský, J., Nedoluzhko, A., Panevová, J., Poláková, L., Ševčíková, M., Štěpánek, J., Zikánová, Š.: Prague dependency treebank 3.0. (2013), http://ufal.mff.cuni.cz/pdt3.0
Popel, M., Žabokrtský, Z.: TectoMT: Modular NLP framework. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) IceTAL 2010. LNCS, vol. 6233, pp. 293–304. Springer, Heidelberg (2010)
Chapter Google Scholar
Pajas, P., Štěpánek, J.: System for querying syntactically annotated corpora. In: Lee, G., Im Walde, S.S. (eds.) Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, pp. 33–36. Association for Computational Linguistics, Suntec (2009)
Chapter Google Scholar
Tiersma, P.: The Creation, Structure, and Interpretation of the Legal Text (2010), http://www.languageandlaw.org/LEGALTEXT.HTM
Kríž, V.: Detecting semantic relations in texts and their integration with external data resources. In: WDS 2013 Proceedings of Contributed Papers, Praha, Czechia, pp. 18–23. Matematicko-fyzikální fakulta Univerzity Karlovy, Matfyzpress (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Formal and Applied Linguistics, Charles University in Prague, Malostranské nám. 25, 118 00, Praha 1, Czech Republic
Vincent Kríž & Barbora Hladká
Department of Software Engineering Faculty of Mathematics and Physics, Charles University in Prague, Malostranské nám. 25, 118 00, Praha 1, Czech Republic
Martin Nečaský & Tomáš Knap

Authors

Vincent Kríž
View author publications
You can also search for this author in PubMed Google Scholar
Barbora Hladká
View author publications
You can also search for this author in PubMed Google Scholar
Martin Nečaský
View author publications
You can also search for this author in PubMed Google Scholar
Tomáš Knap
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Centro de Investigación en Computación, Instituto Politécnico Nacional, Av. Juan Dios Bátiz s/n, Col. Nueva Industrial Vallejo, 07738, Mexico City, Mexico
Alexander Gelbukh
Área Académica de Computación y Electrónica, Carretera Pachuca-Tulancingo, Universidad Autónoma del Estado de Hidalgo, Km. 4.5, Col. Carboneras, Mineral de la Reforma, 42180, Hidalgo, Mexico
Félix Castro Espinoza
Facultad de ciencias, Universidad Autónoma Nacional de México, Ciudad Universitaria, México DF, Mexico
Sofía N. Galicia-Haro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kríž, V., Hladká, B., Nečaský, M., Knap, T. (2014). Data Extraction Using NLP Techniques and Its Transformation to Linked Data. In: Gelbukh, A., Espinoza, F.C., Galicia-Haro, S.N. (eds) Human-Inspired Computing and Its Applications. MICAI 2014. Lecture Notes in Computer Science(), vol 8856. Springer, Cham. https://doi.org/10.1007/978-3-319-13647-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-13647-9_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13646-2
Online ISBN: 978-3-319-13647-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics