Generating Synthetic Data to Allow Learning from a Single Exemplar per Class

Ulanova, Liudmila; Hao, Yuan; Keogh, Eamonn

doi:10.1007/978-3-319-11988-5_17

Liudmila Ulanova¹⁸,
Yuan Hao¹⁸ &
Eamonn Keogh¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8821))

Included in the following conference series:

International Conference on Similarity Search and Applications

959 Accesses

Abstract

Recent years have seen an explosion in the volume of historical documents placed online. The individuality of fonts combined with the degradation suffered by century old manuscripts means that Optical Character Recognition Systems do not work well here. As human transcription is prohibitively expensive, recent efforts focused on human/computer cooperative transcription: a human annotates a small fraction of a text to provide labeled data for recognition algorithms. Such a system naturally begs the question of how much data must the human label? In this work we show that we can do well even if the human labels only a single instance from each class. We achieve this good result using two novel observations: we can leverage off a recently introduced parameter-free distance measure, improving it by taking into account the “complexity” of the glyphs being compared; we can estimate this complexity using synthetic but plausible instances made from the single training instance. We demonstrate the utility of our observations on diverse historical manuscripts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Batista, G., Wang, X., Keogh, E.J.: A Complexity-Invariant Distance Measure for Time Series. In: Proc. of the SDM 2011, pp. 699–710 (2011)
Google Scholar
Campana, B., Keogh, E.: A Compression Based Distance Measure for Texture. In: Proc. of the SDM 2010, pp. 850–861 (2010)
Google Scholar
Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised learning. MIT Press, Cambridge (2006)
Book Google Scholar
Chawla, N., Bowyer, K., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Derolez, A., Lamberti, S.: Audomari Canonici Liber Floridus, Codex Autographus Bibliothecae Universitatis Gandavensis, Ghent (1968)
Google Scholar
Eno, J.: Generating Synthetic Data to Match Data Mining Patterns. IEEE Internet Computing 12(3), 78–82 (2008)
Article Google Scholar
Ha, T., Bunke, H.: Off-line handwritten numeral recognition by perturbation method. IEEE Trans. on Pattern Analysis and Machine Intelligence 19(5), 535–539 (1997)
Article Google Scholar
Hu, B., Rakthanmanon, T., Campana, B., Mueen, A., Keogh, E.: Image Mining of Historical Manuscripts to Establish Provenance. In: Proc. of the SDM 2012, pp. 804–815 (2012)
Google Scholar
Indiana MAS Project, http://indianamas.disi.unige.it/
PaRADIIT Project, https://sites.google.com/site/paradiitproject/
Roy, P., Rayar, F., Ramel, J.Y.: An efficient coarse-to-fine indexing technique for fast text retrieval in historical documents. In: DAS 2012, pp. 150–154 (March 2012)
Google Scholar
Supporting web page, https://sites.google.com/site/singleexemplar/
Wang, J.-G., Neskovic, P., Cooper, L.N.: An adaptive nearest neighbor algorithm for classification. In: Proc. of ICMLC 2005, pp. 3069–3074 (2005)
Google Scholar
Yang, X., Bai, X., Köknar-Tezel, S., Latecki, L.J.: Densifying Distance Spaces for Shape and Image Retrieval. Journal of Mathematical Imaging and Vision, 1–17 (2012)
Google Scholar
Zhang, X., Nagy, G.: The CADAL calligraphic database. In: Proc. of the HIP 2011, pp. 37–42 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, University of California, Riverside, USA
Liudmila Ulanova, Yuan Hao & Eamonn Keogh

Authors

Liudmila Ulanova
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Hao
View author publications
You can also search for this author in PubMed Google Scholar
Eamonn Keogh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of São Paulo, São Carlos, Brazil
Agma Juci Machado Traina
University of Sao Paulo at Sao Carlos - USP, Av. do Trabalhador Saocarlense 400, 13566-590, Sao Carlos, Brazil
Caetano Traina Jr.
University of Sal Paulo at Sao Carlos - USP, Av. do Trabalhador Saocarlense 400, 13566-590, Sao Carlos, Brazil
Robson Leonardo Ferreira Cordeiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ulanova, L., Hao, Y., Keogh, E. (2014). Generating Synthetic Data to Allow Learning from a Single Exemplar per Class. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds) Similarity Search and Applications. SISAP 2014. Lecture Notes in Computer Science, vol 8821. Springer, Cham. https://doi.org/10.1007/978-3-319-11988-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-11988-5_17
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11987-8
Online ISBN: 978-3-319-11988-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics