Skip to main content

Testing of Various Approaches for Semiautomatic Parish Records Word Standardization

  • Conference paper
  • First Online:
Semantic Technology (JIST 2019)

Abstract

This paper deals with the clustering of words from parish records. Clustering is essential for downstream standardization. In the past, mostly in the 17th and beginning of 18th century, the names did not have a standard form, thus for further work, it is essential to create clusters of words that have the same meaning. Besides names, the parish records are in the various languages - Czech, German, Latin, so if we want to have relations between people, occupations, or causes of death in one language, we need to standardize it.

The first step of standardization is pre-processing, then we compare words, and the last step is classification into clusters. The most crucial step here is a comparison of words. We have tested various approaches like Levenshtein distance and its modifications, Q-grams, Jaro-Winkler, and phonetic codings like Soundex and Double-Metaphone. All those methods have been with types of words that can appear in parish records -first and last names, occupation, village, and relationship between people. From these tests, we have chosen the most suitable ways for the clustering of different types of words.

This work was supported by TACR No. TL01000130, by BUT project FIT-S-17-4014 and the IT4IXS: IT4Innovations Excellence in Science project (LQ1602).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gottapu, R.D., Dagli, C., Ali, B.: Entity resolution using convolutional neural network. Procedia Comput. Sci. 95, 153–158 (2016)

    Article  Google Scholar 

  2. Dintelman, S., Maness, T.: Reconstituting the population of a small European town using probabilistic record linking: a case study. In: Family History Technology Workshop, BYU (2009)

    Google Scholar 

  3. Milani, G., Masciullo, C., et al.: Computer-based genealogy reconstruction in founder populations. J. Biomed. Inf. 44, 997–1003 (2011)

    Article  Google Scholar 

  4. Traglia, M., et al.: Heritability and demographic analyses in the large isolated population of Val Borbera suggest advantages in mapping complex traits genes. PLoS ONE 4, e7554 (2009)

    Article  Google Scholar 

  5. Bloothooft, G., et al.: Population Reconstruction. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19884-2. ISBN 978-3-319-198833-5

    Book  Google Scholar 

  6. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997). ISBN 0-521-58519-8

    Book  Google Scholar 

  7. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage. Entity Resolution and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2. ISBN 978-3-642-31163-5

    Book  Google Scholar 

  8. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  9. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  10. Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, pp. 354–359. American Statistical Association (1990)

    Google Scholar 

  11. Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1996, pp. 166–172. ACM, New York (1996). ISBN 0-89791-792-8

    Google Scholar 

  12. Zboril, F., Rozman, J., Kocí, R.: Algorithmic creation of genealogical models. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds.) ISDA 2018 2018. AISC, vol. 941, pp. 650–658. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16660-1_63. ISBN 978-3-030-16659-5

    Chapter  Google Scholar 

  13. Rozman, J., Zboril, F.: Persons linking in baptism records. In: Workshop PAOS2018 and PASSCR2018 of JIST2018 Conference, pp. 43–54, Awaji (2018). ISSN 1613–0073

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaroslav Rozman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rozman, J., Hříbek, D., Zbořil, F. (2020). Testing of Various Approaches for Semiautomatic Parish Records Word Standardization. In: Wang, X., Lisi, F., Xiao, G., Botoeva, E. (eds) Semantic Technology. JIST 2019. Communications in Computer and Information Science, vol 1157. Springer, Singapore. https://doi.org/10.1007/978-981-15-3412-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-3412-6_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-3411-9

  • Online ISBN: 978-981-15-3412-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics