Testing of Various Approaches for Semiautomatic Parish Records Word Standardization

Rozman, Jaroslav; Hříbek, David; Zbořil, František

doi:10.1007/978-981-15-3412-6_3

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1157))

Included in the following conference series:

Joint International Semantic Technology Conference

718 Accesses

Abstract

This paper deals with the clustering of words from parish records. Clustering is essential for downstream standardization. In the past, mostly in the 17th and beginning of 18th century, the names did not have a standard form, thus for further work, it is essential to create clusters of words that have the same meaning. Besides names, the parish records are in the various languages - Czech, German, Latin, so if we want to have relations between people, occupations, or causes of death in one language, we need to standardize it.

The first step of standardization is pre-processing, then we compare words, and the last step is classification into clusters. The most crucial step here is a comparison of words. We have tested various approaches like Levenshtein distance and its modifications, Q-grams, Jaro-Winkler, and phonetic codings like Soundex and Double-Metaphone. All those methods have been with types of words that can appear in parish records -first and last names, occupation, village, and relationship between people. From these tests, we have chosen the most suitable ways for the clustering of different types of words.

This work was supported by TACR No. TL01000130, by BUT project FIT-S-17-4014 and the IT4IXS: IT4Innovations Excellence in Science project (LQ1602).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gottapu, R.D., Dagli, C., Ali, B.: Entity resolution using convolutional neural network. Procedia Comput. Sci. 95, 153–158 (2016)
Article Google Scholar
Dintelman, S., Maness, T.: Reconstituting the population of a small European town using probabilistic record linking: a case study. In: Family History Technology Workshop, BYU (2009)
Google Scholar
Milani, G., Masciullo, C., et al.: Computer-based genealogy reconstruction in founder populations. J. Biomed. Inf. 44, 997–1003 (2011)
Article Google Scholar
Traglia, M., et al.: Heritability and demographic analyses in the large isolated population of Val Borbera suggest advantages in mapping complex traits genes. PLoS ONE 4, e7554 (2009)
Article Google Scholar
Bloothooft, G., et al.: Population Reconstruction. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19884-2. ISBN 978-3-319-198833-5
Book Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997). ISBN 0-521-58519-8
Book Google Scholar
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage. Entity Resolution and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2. ISBN 978-3-642-31163-5
Book Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet Google Scholar
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Article Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, pp. 354–359. American Statistical Association (1990)
Google Scholar
Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1996, pp. 166–172. ACM, New York (1996). ISBN 0-89791-792-8
Google Scholar
Zboril, F., Rozman, J., Kocí, R.: Algorithmic creation of genealogical models. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds.) ISDA 2018 2018. AISC, vol. 941, pp. 650–658. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16660-1_63. ISBN 978-3-030-16659-5
Chapter Google Scholar
Rozman, J., Zboril, F.: Persons linking in baptism records. In: Workshop PAOS2018 and PASSCR2018 of JIST2018 Conference, pp. 43–54, Awaji (2018). ISSN 1613–0073
Google Scholar

Download references

Author information

Authors and Affiliations

Brno University of Technology, Brno, Czech Republic
Jaroslav Rozman, David Hříbek & František Zbořil

Authors

Jaroslav Rozman
View author publications
You can also search for this author in PubMed Google Scholar
David Hříbek
View author publications
You can also search for this author in PubMed Google Scholar
František Zbořil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaroslav Rozman .

Editor information

Editors and Affiliations

Tianjin University, Tianjin, China
Xin Wang
University of Bari Aldo Moro, Bari, Italy
Francesca A. Lisi
Free University of Bozen-Bolzano, Bolzano, Italy
Guohui Xiao
Imperial College London, London, UK
Elena Botoeva

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rozman, J., Hříbek, D., Zbořil, F. (2020). Testing of Various Approaches for Semiautomatic Parish Records Word Standardization. In: Wang, X., Lisi, F., Xiao, G., Botoeva, E. (eds) Semantic Technology. JIST 2019. Communications in Computer and Information Science, vol 1157. Springer, Singapore. https://doi.org/10.1007/978-981-15-3412-6_3

Download citation

DOI: https://doi.org/10.1007/978-981-15-3412-6_3
Published: 19 February 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3411-9
Online ISBN: 978-981-15-3412-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics