Statistical Data Generation Using Sample Data

Fazekas, Bálint; Kiss, Attila

doi:10.1007/978-3-030-00063-9_4

Bálint Fazekas¹⁵ &
Attila Kiss¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 909))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1301 Accesses

Abstract

Due to the ever increasing data stored in databases, it is important to develop software which can generate large numbers of test data that reflect the properties of a given sample. By generating such data, database algorithms can be stress-tested and evaluated by their performance. If the generated data is much greater in number than the given sample, then the process is called data augmentation or synthetic data generation. Data augmentation can also be very useful in Big Data benchmarking tests. The scope of this paper is to describe a method for statistical data generation based on a given sample, where the generated result attempts to reflect the statistical properties of the sample as much as possible. Throughout the paper we explain how any given data can be represented numerically, and hence clustered using the DBSCAN and K-means algorithms. We introduce a hybrid clustering method, which combines both of the previously mentioned algorithms. The hybrid algorithm focuses on unifying the strengths of both clustering algorithms. After the data is clustered, the individual sub-clusters are statistically analyzed, and based on the analytical results pseudo-random data are generated. The results of the hybrid clustering algorithm show that such artificial data can be created, which reflect the statistical properties of any given sample.

Dr. Kiss was also with J. Selye University, Komárno, Slovakia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Rabl, T., Jacobsen, H.-A.: Big data generation. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB -2012. LNCS, vol. 8163, pp. 20–27. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-53974-9_3
Chapter Google Scholar
Soltana, G., Sabetzadeh, M., Briand, L.C.: Synthetic data generation for statistical testing. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE Press (2017)
Google Scholar
Nowok, B., Raab, G.M., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)
Article Google Scholar
Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment (2006)
Google Scholar
Gray, J., et al.: Quickly generating billion-record synthetic databases. ACM Sigmod Rec. 23(2) (1994)
Google Scholar
Loong, B.W.L.: Topics and applications in synthetic data. Harvard University, Dissertation (2012)
Google Scholar
Pei, Y., Zaïane, O.: A synthetic data generator for clustering and outlier analysis. Computing Science Department, University of Alberta, Edmonton, Canada T6G 2E8
Google Scholar
Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A data generator for cloud-scale benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18206-8_4
Chapter Google Scholar
Ghazal, A., et al.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM (2013)
Google Scholar
Ming, Z., et al.: BDGS: a scalable big data generator suite in big data benchmarking. In: Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 138–154. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10596-3_11
Chapter Google Scholar
Birant, D., Kut, A.: ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data Know. Eng. 60(1), 208–221 (2007)
Article Google Scholar

Download references

Acknowledgements

The project was supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002).

Author information

Authors and Affiliations

Faculty of Informatics Department of Information System, ELTE Eötvös Loránd University, Budapest, Hungary
Bálint Fazekas & Attila Kiss

Authors

Bálint Fazekas
View author publications
You can also search for this author in PubMed Google Scholar
Attila Kiss
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bálint Fazekas .

Editor information

Editors and Affiliations

Eötvös Loránd University, Budapest, Hungary
András Benczúr
Abt. Informatik, Universität Kiel, Kiel, Germany
Bernhard Thalheim
Eötvös Loránd University, Budapest, Hungary
Tomáš Horváth
Politecnico di Torino, Turin, Italy
Silvia Chiusano
Polytechnic University of Turin, Turin, Italy
Tania Cerquitelli
Hungarian Academy of Sciences, Budapest, Hungary
Csaba Sidló
University of Nebraska–Lincoln, Lincoln, NE, USA
Peter Z. Revesz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fazekas, B., Kiss, A. (2018). Statistical Data Generation Using Sample Data. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-00063-9_4
Published: 31 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00062-2
Online ISBN: 978-3-030-00063-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics