Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python

Nguyen, Bich-Ngan T.; Phạm, Phuong N. H.; Nguyen, Vu Thanh; Viet, Phan Quoc; Tuan, Le Dinh; Snasel, Vaclav

doi:10.1007/978-981-33-4370-2_6

Bich-Ngan T. Nguyen⁹,
Phuong N. H. Phạm⁹,
Vu Thanh Nguyen⁹,
Phan Quoc Viet⁹,
Le Dinh Tuan¹⁰ &
…
Vaclav Snasel¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1306))

Included in the following conference series:

International Conference on Future Data and Security Engineering

1320 Accesses

Abstract

Py_ape is a package in Python that integrates a number of string and text processing algorithms for collecting, extracting, and cleaning text data from websites, creating frames for text corpora, and matching entities, matching two schemas, mapping and merging two schemas. The functions of Py_ape help the user step-by-step perform data integration and data preparation, based on some popular Python libraries. Especially in the entity matching function of the schema matching and merging phase, we used the Hamming distance algorithm to identify similar string pairs, and the longest common substring similarity algorithm to map data between the columns of schemas. These algorithms help to increase the accuracy of the schema matching process. In addition, in the article, we present experimental results using Py_ape to scrape, clean, match, and merge two sets of data related to aviation crashes, taken from different sources of Kaggle and Wikipedia. The result of the experiment will be evaluated in detail in the rest of the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Chen, C., Golshan, B., Halevy, A., Tan, W.-C., Doan, A.H.: BigGorilla: an open-source ecosystem for data preparation and integration. Comput. Sci. IEEE Data Eng. Bull. (2018)
Google Scholar
Doan, A., Halevy, A., Ives, Z.: Principles of Data Integration, 1st edn. Morgan Kaufmann (2012)
Google Scholar
Golshan, B., Halevy, A.Y., Mihaila, G.A., Tan, W.: Data integration: after the teenage years. In: PODS (2017)
Google Scholar
Miller, R.J.: The future of data integration. In: KDD, p. 3 (2017)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005)
Google Scholar
Pessig, P.: Entity matching using Magellan - matching drug reference tables. In: CPCP Retreat (2017). http://cpcp.wisc.edu/resources/cpcp-2017-retreat-entity-matching
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: SIGMOD-18 (2018)
Google Scholar
Konda, P., et al.: Magellan: toward building entity matching management systems. PVLDB 9(12), 1197–1208 (2016)
Google Scholar
Wang, S., Jiang, J.: A compare-aggregate model for matching text sequences. In: ICLR (2017)
Google Scholar
Yu, M., et al.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2016)
Article Google Scholar
Bloor Research International: Self-Service Data Preparation and Cataloguing (2016). https://www.bloorresearch.com/research/self-service-data-preparation-cataloguing/. Accessed 14 May 2018
Heer, J., Hellerstein, J., Kandel, S.: Predictive interaction for data transformation. In: Proceedings of the Conference on Innovative Data Systems Research (CIDR) (2015)
Google Scholar
Jin, Z., et al.: Foofah: transforming data by example. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 683–698. ACM (2017)
Google Scholar
Kopelowitz, T., Porat, E.: A simple algorithm for approximating the text-to-pattern hamming distance. In: 1st Symposium on Simplicity in Algorithms (SOSA 2018) (2018)
Google Scholar
Ho, T., Oh, S., Kim, H.: New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. J. Supercomput. 74, 1815–1834 (2018). https://doi.org/10.1007/s11227-017-2192-6
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. JMLR 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Bernstein, P.A., Melnik, S.: Metadata management. In: Proceedings of the IEEE CS International Conference on Data Engineering. IEEE Computer Society (2004)
Google Scholar
Mittal, S., Nag, S.: A survey of encoding techniques for reducing data-movement energy. J. Syst. Arch. 97, 373–396 (2019)
Article Google Scholar
Apostolico, A., et al.: Sequence similarity measures based on bounded hamming distance. Theoret. Comput. Sci. 638, 76–90 (2016)
Article MathSciNet Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, pp. 125–128. Cambridge University Press, Cambridge (1999). ISBN 0-521-58519-8
Google Scholar
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. (0975–8887). 68(13) (2013)
Google Scholar
Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2015). https://doi.org/10.1007/s11704-015-5900-5
Article Google Scholar
Recruit Holdings Co., Ltd.: Recruit’s Artificial Intelligence Laboratory Releases BigGorilla: An Open-source Data Integration and Data Preparation Ecosystem (2019). https://recruit-holdings.com/news_data/release/2017/0630_7890.html

Download references

Author information

Authors and Affiliations

Ho Chi Minh City University of Food Industry, Ho Chi Minh City, Vietnam
Bich-Ngan T. Nguyen, Phuong N. H. Phạm, Vu Thanh Nguyen & Phan Quoc Viet
Long an University of Economics and Industry, Tân an, Vietnam
Le Dinh Tuan
VSB-Technical University of Ostrava, Ostrava, Czech Republic
Vaclav Snasel

Authors

Bich-Ngan T. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Phuong N. H. Phạm
View author publications
You can also search for this author in PubMed Google Scholar
Vu Thanh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Phan Quoc Viet
View author publications
You can also search for this author in PubMed Google Scholar
Le Dinh Tuan
View author publications
You can also search for this author in PubMed Google Scholar
Vaclav Snasel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bich-Ngan T. Nguyen or Vu Thanh Nguyen .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
Hosei University, Tokyo, Japan
Makoto Takizawa
Sungkyunkwan University, Suwon, Korea (Republic of)
Tai M. Chung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, BN.T., Phạm, P.N.H., Nguyen, V.T., Viet, P.Q., Tuan, L.D., Snasel, V. (2020). Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2020. Communications in Computer and Information Science, vol 1306. Springer, Singapore. https://doi.org/10.1007/978-981-33-4370-2_6

Download citation

DOI: https://doi.org/10.1007/978-981-33-4370-2_6
Published: 19 November 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4369-6
Online ISBN: 978-981-33-4370-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics