Abstract
In document search, documents are typically seen as a flat list of keywords. To deal with the syntactic interoperability, i.e., the use of different keywords to refer to the same real world entity, entity linkage has been used to replace keywords in the text with a unique identifier of the entity to which they are referring. Yet, the flat list of entities fails to capture the actual relationships that exist among the entities, information that is significant for a more effective document search. In this work we propose to go one step further from entity linkage in text, and model the documents as a set of structures that describe relationships among the entities mentioned in the text. We show that this kind of representation is significantly improving the effectiveness of document search. We describe the details of the implementation of the above idea and we present an extensive set of experimental results that prove our point.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The meaning of a “noun phrase” is the one used in linguistics.
- 2.
Note that a “raw document” is what we defined as the document that the user provided, while a “document” is a set of statements containing identifiers and verbs.
References
Aditya, B., Bhalotia, G., Chakrabarti, S., Hulgeri, A., Nakhe, C., Parag, S.: Banks: browsing and keyword searching in relational databases. In: VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, Hong Kong, China, 20-23 August 2002, pp. 1083–1086 (2002)
Agrawal, S., Chaudhuri, S., Das, G.: Dbxplorer: A system for keyword-based search over relational databases. In: Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, 26 February - 1 March 2002, pp. 5–16 (2002)
Ando, R.K., Lee, L.: Iterative residual rescaling: an analysis and generalization of lsi. In: Proceedings of the 24st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2001)
Arguello, J., Elsas, J.L., Callan, J., Carbonell, J.G.: Document representation and query expansion models for blog recommendation. In: Association for the Advancement of Artificial Intelligence Conference (2008)
Bergamaschi, S., Guerra, F., Interlandi, M., Trillo-Lado, R., Velegrakis, Y.: Combining user and database perspective for solving keyword queries over relational databases. Inf. Syst. 2016(55), 1–19 (2016)
Bergamaschi, S., Domnori, E., Guerra, F., Trillo-Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, 12-16 June 2011, pp. 565–576 (2011)
Bouquet, P., Stoermer, H., Niederee, C., Mana, A.: Entity name system: The backbone of an open and scalable web of data. In: Proceedings of the IEEE International Conference on Semantic Computing, pp. 554–561 (2008)
Bykau, S., Korn, F., Srivastava, D., Velegrakis, Y.: Fine-grained controversy detection in wikipedia. In: 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, 13-17 April 2015, pp. 1573–1584 (2015). http://dx.doi.org/10.1109/ICDE.2015.7113426
Cao, T.H., Tang, T.M., Chau, C.K.: Text clustering with named entities: a model, experimentation and realization. In: Holmes, D.E., Jain, L.C. (eds.) Data Mining: Foundations and Intelligent Paradigms. ISRL, vol. 23, pp. 267–287. Springer, Heidelberg (2012)
Caputo, A., Basile, P., Semerato, G.: Integrating named entities in a semantic search engine. In: Proceedings of the 1st Italian Information Retrieval Workshop (2010)
Carroll, J.J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.: Jena: implementing the semantic web recommendations. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters (2004)
Guo, J., Xu, G., Cheng, X., Li, H.: Named entity recognition in query. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, 19-23 July 2009, pp. 267–274 (2009). http://doi.acm.org/10.1145/1571941.1571989
Hensman, S.: Construction of conceptual graph representation of texts. In: HLT-SRWS 2004 Proceedings of the Student Research Workshop at HLT-NAACL (2004)
Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, Hong Kong, China, 20-23 August 2002, pp. 670–681 (2002)
Ioannou, E., Nejdl, W., Niedere, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. Proc. VLDB Endowment 3, 429–438 (2010)
Ioannou, E., Rassadko, N., Velegrakis, Y.: On generating benchmark data for entity matching. J. Data Semant. 2(1), 37–56 (2013)
Leskovec, J., Grobelnik, M., Milic-Frayling, N.: Learning sub-structures of document semantic graphs for document summarization. In: Workshop on Link Analysis and Group Detection (LinkKDD) (2004)
Lucene: https://lucene.apache.org
Luo, Y., Lin, X., Wang, W., Zhou, X.: Spark: top-k keyword query in relational databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, 12-14 June 2007, pp. 115–126. ACM (2007)
Mihalcea, R., Moldovan, D.: Document indexing using named entities (2001)
Mihalcea, R., Moldovan, D.I.: Document indexing using named entities. In: Studies in Informatics and Control (2001)
Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: Exemplar queries: Give me an example of what you need. PVLDB 7(5), 365–376 (2014)
Mottin, D., Marascu, A., Roy, S.B., Das, G., Palpanas, T., Velegrakis, Y.: A probabilistic optimization framework for the empty-answer problem. PVLDB 6(14), 1762–1773 (2013)
Roelleke, T., Wang, J.: Tf-idf uncovered: a study of theories and probabilities. In: Proceedings of the 31st Annual International ACM SIGIR conference on Research and Development in Information Retrieval (2008)
Steven, B., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc., Sebastopol (2009)
Tata, S., Lohman, G.M.: SQAK: doing more with keywords. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, 10-12 June 2008, pp. 889–902. ACM (2008)
The OpenCalais System: http://www.opencalais.com
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. Proc. VLDB Endowment 1, 1008–1019 (2008)
Zhang, L., Yu, Y.: Learning to generate CGs from domain specific sentences. In: Delugach, H.S., Stumme, G. (eds.) ICCS 2001. LNCS (LNAI), vol. 2120, pp. 44–57. Springer, Heidelberg (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Sartori, E., Velegrakis, Y., Guerra, F. (2016). Entity-Based Keyword Search in Web Documents. In: Nguyen, N.T., Kowalczyk, R., Rupino da Cunha, P. (eds) Transactions on Computational Collective Intelligence XXI. Lecture Notes in Computer Science(), vol 9630. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49521-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-662-49521-6_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49520-9
Online ISBN: 978-3-662-49521-6
eBook Packages: Computer ScienceComputer Science (R0)