Abstract
Py_ape is a package in Python that integrates a number of string and text processing algorithms for collecting, extracting, and cleaning text data from websites, creating frames for text corpora, and matching entities, matching two schemas, mapping and merging two schemas. The functions of Py_ape help the user step-by-step perform data integration and data preparation, based on some popular Python libraries. Especially in the entity matching function of the schema matching and merging phase, we used the Hamming distance algorithm to identify similar string pairs, and the longest common substring similarity algorithm to map data between the columns of schemas. These algorithms help to increase the accuracy of the schema matching process. In addition, in the article, we present experimental results using Py_ape to scrape, clean, match, and merge two sets of data related to aviation crashes, taken from different sources of Kaggle and Wikipedia. The result of the experiment will be evaluated in detail in the rest of the paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
References
Chen, C., Golshan, B., Halevy, A., Tan, W.-C., Doan, A.H.: BigGorilla: an open-source ecosystem for data preparation and integration. Comput. Sci. IEEE Data Eng. Bull. (2018)
Doan, A., Halevy, A., Ives, Z.: Principles of Data Integration, 1st edn. Morgan Kaufmann (2012)
Golshan, B., Halevy, A.Y., Mihaila, G.A., Tan, W.: Data integration: after the teenage years. In: PODS (2017)
Miller, R.J.: The future of data integration. In: KDD, p. 3 (2017)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005)
Pessig, P.: Entity matching using Magellan - matching drug reference tables. In: CPCP Retreat (2017). http://cpcp.wisc.edu/resources/cpcp-2017-retreat-entity-matching
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: SIGMOD-18 (2018)
Konda, P., et al.: Magellan: toward building entity matching management systems. PVLDB 9(12), 1197–1208 (2016)
Wang, S., Jiang, J.: A compare-aggregate model for matching text sequences. In: ICLR (2017)
Yu, M., et al.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2016)
Bloor Research International: Self-Service Data Preparation and Cataloguing (2016). https://www.bloorresearch.com/research/self-service-data-preparation-cataloguing/. Accessed 14 May 2018
Heer, J., Hellerstein, J., Kandel, S.: Predictive interaction for data transformation. In: Proceedings of the Conference on Innovative Data Systems Research (CIDR) (2015)
Jin, Z., et al.: Foofah: transforming data by example. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 683–698. ACM (2017)
Kopelowitz, T., Porat, E.: A simple algorithm for approximating the text-to-pattern hamming distance. In: 1st Symposium on Simplicity in Algorithms (SOSA 2018) (2018)
Ho, T., Oh, S., Kim, H.: New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. J. Supercomput. 74, 1815–1834 (2018). https://doi.org/10.1007/s11227-017-2192-6
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. JMLR 12, 2825–2830 (2011)
Bernstein, P.A., Melnik, S.: Metadata management. In: Proceedings of the IEEE CS International Conference on Data Engineering. IEEE Computer Society (2004)
Mittal, S., Nag, S.: A survey of encoding techniques for reducing data-movement energy. J. Syst. Arch. 97, 373–396 (2019)
Apostolico, A., et al.: Sequence similarity measures based on bounded hamming distance. Theoret. Comput. Sci. 638, 76–90 (2016)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, pp. 125–128. Cambridge University Press, Cambridge (1999). ISBN 0-521-58519-8
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. (0975–8887). 68(13) (2013)
Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2015). https://doi.org/10.1007/s11704-015-5900-5
Recruit Holdings Co., Ltd.: Recruit’s Artificial Intelligence Laboratory Releases BigGorilla: An Open-source Data Integration and Data Preparation Ecosystem (2019). https://recruit-holdings.com/news_data/release/2017/0630_7890.html
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Nguyen, BN.T., Phạm, P.N.H., Nguyen, V.T., Viet, P.Q., Tuan, L.D., Snasel, V. (2020). Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2020. Communications in Computer and Information Science, vol 1306. Springer, Singapore. https://doi.org/10.1007/978-981-33-4370-2_6
Download citation
DOI: https://doi.org/10.1007/978-981-33-4370-2_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4369-6
Online ISBN: 978-981-33-4370-2
eBook Packages: Computer ScienceComputer Science (R0)