Skip to main content

Statistical Data Generation Using Sample Data

  • Conference paper
  • First Online:
New Trends in Databases and Information Systems (ADBIS 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 909))

Included in the following conference series:

  • 1301 Accesses

Abstract

Due to the ever increasing data stored in databases, it is important to develop software which can generate large numbers of test data that reflect the properties of a given sample. By generating such data, database algorithms can be stress-tested and evaluated by their performance. If the generated data is much greater in number than the given sample, then the process is called data augmentation or synthetic data generation. Data augmentation can also be very useful in Big Data benchmarking tests. The scope of this paper is to describe a method for statistical data generation based on a given sample, where the generated result attempts to reflect the statistical properties of the sample as much as possible. Throughout the paper we explain how any given data can be represented numerically, and hence clustered using the DBSCAN and K-means algorithms. We introduce a hybrid clustering method, which combines both of the previously mentioned algorithms. The hybrid algorithm focuses on unifying the strengths of both clustering algorithms. After the data is clustered, the individual sub-clusters are statistically analyzed, and based on the analytical results pseudo-random data are generated. The results of the hybrid clustering algorithm show that such artificial data can be created, which reflect the statistical properties of any given sample.

Dr. Kiss was also with J. Selye University, Komárno, Slovakia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Rabl, T., Jacobsen, H.-A.: Big data generation. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB -2012. LNCS, vol. 8163, pp. 20–27. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-53974-9_3

    Chapter  Google Scholar 

  2. Soltana, G., Sabetzadeh, M., Briand, L.C.: Synthetic data generation for statistical testing. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE Press (2017)

    Google Scholar 

  3. Nowok, B., Raab, G.M., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)

    Article  Google Scholar 

  4. Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment (2006)

    Google Scholar 

  5. Gray, J., et al.: Quickly generating billion-record synthetic databases. ACM Sigmod Rec. 23(2) (1994)

    Google Scholar 

  6. Loong, B.W.L.: Topics and applications in synthetic data. Harvard University, Dissertation (2012)

    Google Scholar 

  7. Pei, Y., Zaïane, O.: A synthetic data generator for clustering and outlier analysis. Computing Science Department, University of Alberta, Edmonton, Canada T6G 2E8

    Google Scholar 

  8. Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A data generator for cloud-scale benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18206-8_4

    Chapter  Google Scholar 

  9. Ghazal, A., et al.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM (2013)

    Google Scholar 

  10. Ming, Z., et al.: BDGS: a scalable big data generator suite in big data benchmarking. In: Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 138–154. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10596-3_11

    Chapter  Google Scholar 

  11. Birant, D., Kut, A.: ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data Know. Eng. 60(1), 208–221 (2007)

    Article  Google Scholar 

Download references

Acknowledgements

The project was supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bálint Fazekas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fazekas, B., Kiss, A. (2018). Statistical Data Generation Using Sample Data. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00063-9_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00062-2

  • Online ISBN: 978-3-030-00063-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics