Abstract
Online product search engines such as Google and Yahoo shopping, rely on having extensive and complete product information to return accurate and timely search results. Given the expanding scope of products and updates to existing products, automated techniques are needed to ensure the underlying product dictionaries remain current and complete. Product search engines receive offers from merchants describing product specific attributes and characteristics. These offers normally contain structured attribute-value pairs, and unstructured (textual) descriptions describing product characteristics and features. For example, a laptop offer may contain attribute-value pairs such as “model-X42” and “RAM-8 GB”, and a text description of the software, accessories, battery features, warranty, etc. Updating the product dictionaries using the textual descriptions is a more challenging task than using the attribute-value pairs since the relevant attribute values must first be extracted. This task becomes difficult since the text descriptions often do not follow a predefined format, and the data in the descriptions vary across different merchants and products. However, this information needs to be captured to ensure a comprehensive and complete product listing. In this paper, we present techniques that extract attribute values from textual product descriptions. We introduce an end-to-end framework that takes an input string record, and parses the tokens in a record to identify candidate attribute values. We then map these values to attributes. We take an information theoretic approach to identify groups of tokens that represent an attribute value. We demonstrate the accuracy and relevance of our approach using a variety of real data sets.
R.J. Miller—Supported by NSERC BIN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Since a record contains tokens that may match a segment, we can think of the records containing segments.
References
Yelp dataset challenge (2011). www.yelp.ca/dataset_challenge
General electric second annual major purchase shopper study. GE Capital Retail (2013)
Agathangelou, P., Katakis, I., Kokkoras, F., Ntonas, K.: Mining domain-specific dictionaries of opinion words. In: Web Information Systems Engineering, pp. 47–62 (2014)
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: SIGMOD Conference, pp. 207–216 (1993)
Bing, L., Lam, W., Wong, T.-L.: Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: International Conference on Web Search and Data Mining, WSDM, pp. 567–576 (2013)
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. SIGMOD Rec. 30(2), 175–186 (2001)
Chaturvedi, S., Prasad, K.H., Faruquie, T.A., Chawda, B., Subramaniam, L.V., Krishnapuram, R.: Automating pattern discovery for rule based data standardization systems. In: ICDE, pp. 1231–1241 (2013)
Chiang, F., Andritsos, P., Zhu, E., Miller, R.J.: Autodict: automated dictionary discovery. In: ICDE, pp. 1277–1280 (2012)
Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: SIGKDD, pp. 89–98 (2004)
Cortez, E., da Silva, A.S., Gonçalves, M.A., de Moura, E.S.: Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD Conference, pp. 807–818 (2010)
Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)
Godbole, S., Bhattacharya, I., Gupta, A., Verma, A.: Building re-usable dictionary repositories for real-world text mining. In: CIKM, pp. 1189–1198 (2010)
Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of ACL, pp. 423–430 (2003)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)
Lee, T., Wang, Z., Wang, H., Hwang, S.-W.: Attribute extraction and scoring: a probabilistic approach. In: ICDE, pp. 194–205 (2013)
Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. SIGMOD 2011, pp. 529–540 (2011)
Li, X., Wang, Y.-Y., Acero, A.: Extracting structured information from user queries with semi-supervised conditional random fields. In: SIGIR 2009, pp. 572–579 (2009)
Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endow. 4(7), 409–418 (2011)
Pantel, P., Philpot, A., Hovy, E.H.: An information theoretic model for database alignment. In: SSDBM, pp. 14–23 (2005)
Peshkin, L., Pfeffer, A.: Bayesian information extraction network. In: IJCAI 2003, pp. 421–426 (2003)
Rissanen, J.: Modeling shortest data description. In: Automatica (1978)
Roy, S., Chiticariu, L., Feldman, V., Reiss, F., Zhu, H.: Provenance-based dictionary refinement in information extraction. In: SIGMOD Conference, pp. 457–468 (2013)
Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS, pp. 1185–1192 (2004)
Sarkas, N., Paparizos, S., Tsaparas, P.: Structured annotations of web queries. In: SIGMOD Conference, pp. 771–782 (2010)
Slonim, N., Tishby, N.: Agglomerative information bottleneck. In: NIPS, pp. 617–623 (1999)
Socher, R., Bauer, J., Manning, C., Ng, A.: Parsing with compositional vector grammars. In: Proceedings of ACL, pp. 455–465 (2013)
Sutton, C., Mccallum, A.: Introduction to Conditional Random Fields for Relational Learning. MIT Press, Cambridge (2006)
Tan, B., Peng, F.: Unsupervised query segmentation using generative language models and wikipedia. In: WWW, pp. 347–356 (2008)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Survey Research, pp. 354–359 (1990)
Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: Automatic discovery of attributes in relational databases. In: SIGMOD Conference, pp. 109–120 (2011)
Zhang, Z., Zhu, K.Q., Wang, H., Li, H.: Automatic extraction of top-k lists from the web. In: ICDE, pp. 1057–1068 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Chiang, F., Andritsos, P., Miller, R.J. (2016). Data Driven Discovery of Attribute Dictionaries. In: Nguyen, N.T., Kowalczyk, R., Rupino da Cunha, P. (eds) Transactions on Computational Collective Intelligence XXI. Lecture Notes in Computer Science(), vol 9630. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49521-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-662-49521-6_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49520-9
Online ISBN: 978-3-662-49521-6
eBook Packages: Computer ScienceComputer Science (R0)