Skip to main content
Log in

Big Data-Oriented PaaS Architecture with Disk-as-a-Resource Capability and Container-Based Virtualization

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

With the increasing adoption of Big Data technologies as basic tools for the ongoing Digital Transformation, there is a high demand for data-intensive applications. In order to efficiently execute such applications, it is vital that cloud providers change the way hardware infrastructure resources are managed to improve their performance. However, the increasing use of virtualization technologies to achieve an efficient usage of infrastructure resources continuously widens the gap between applications and the underlying hardware, thus decreasing resource efficiency for the end user. Moreover, this scenario is especially troublesome for Big Data applications, as storage resources are one of the most heavily virtualized, thus imposing a significant overhead for large-scale data processing. This paper proposes a novel PaaS architecture specifically oriented for Big Data where the scheduler offers disks as resources alongside the more common CPU and memory resources, looking forward to provide a better storage solution for the user. Furthermore, virtualization overheads are reduced to the bare minimum by replacing heavy hypervisor-based technologies with operating-system-level virtualization based on light software containers. This architecture has been deployed on a Big Data infrastructure at the CESGA supercomputing center, used as a testbed to compare its performance with OpenStack, a popular private cloud platform. Results have shown significant performance improvements, reducing the execution time of representative Big Data workloads by up to 4.5×.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amazon Web Services (AWS): https://aws.amazon.com/. Last visited: June 2018

  2. Axboe, J.: FIO tool github site. https://github.com/axboe/fio. Last visited: June 2018

  3. Bakshi, K.: Considerations for Big Data: architecture and approach. In: IEEE Aerospace Conference, AeroConf’12, pp 1–7. Big Sky (2012)

  4. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: 19th ACM Symposium on Operating Systems Principles, SOSP’03, pp 164–177. Bolton Landing (2003)

  5. Bernstein, D.: Containers and cloud: from LXC to Docker to Kubernetes. IEEE Cloud Comput. 1 (3), 81–84 (2014)

    Article  Google Scholar 

  6. Big Data Evaluator (BDEv): http://bdev.des.udc.es/. Last visited: June 2018

  7. Bryk, P., Malawski, M., Juve, G., Deelman, E.: Storage-aware algorithms for scheduling of workflow ensembles in clouds. J. Grid Comput. 14(2), 359–378 (2016)

    Article  Google Scholar 

  8. Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25(6), 599–616 (2009)

    Article  Google Scholar 

  9. Caballer, M., Zala, S., García, Á.L., Moltó, G., Fernández, P.O., Velten, M.: Orchestrating complex application architectures in heterogeneous clouds. J. Grid Comput. 16(1), 3–18 (2018)

    Article  Google Scholar 

  10. CESGA Supercomputing Center website: http://www.cesga.es/. Last visited: June 2018

  11. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: 1st ACM Symposium on Cloud Computing, SoCC’10, pp 143–154. Indianapolis (2010)

  12. Darwin, P.B., Kozlowski, P.: AngularJS web application development. Packt Publishing (2013)

  13. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  14. Dua, R., Raja, A.R., Kakadia, D.: Virtualization vs containerization to support PaaS. In: IEEE International Conference on Cloud Engineering, IC2E’14, pp 610–614. Boston (2014)

  15. Expósito, R.R., Taboada, G.L., Ramos, S., González-Domínguez, J., Touriño, J., Doallo, R.: Analysis of I/O performance on an Amazon EC2 cluster compute and high I/O platform. J. Grid Comput. 11(4), 613–631 (2013)

    Article  Google Scholar 

  16. Ghoshal, D., Canon, R.S., Ramakrishnan, L.: I/O performance of virtualized cloud environments. In: 2nd International Workshop on Data Intensive Computing in the Clouds, DataCloud-SC’11, pp 71–80. Seattle (2011)

  17. Google Compute Engine (GCE): https://cloud.google.com/compute/. Last visited: June 2018

  18. Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R.H., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI’11, pp 295–308. Boston (2011)

  19. Jacobs, A.: The pathologies of Big Data. Commun. ACM 52(8), 36–44 (2009)

    Article  Google Scholar 

  20. Ji, C., Li, Y., Qiu, W., Awada, U., Li, K.: Big Data processing in cloud computing environments. In: 12th International Symposium on Pervasive Systems, Algorithms and Networks, I-SPAN’12, pp 17–23. San Marcos (2012)

  21. Kaisler, S., Armour, F., Espinosa, J.A., Money, W.: Big Data: issues and challenges moving forward. In: 46th Hawaii International Conference on System Sciences, HICSS’13, pp 995–1004. Wailea (2013)

  22. Katal, A., Wazid, M., Goudar, R.H.: Big Data: issues, challenges, tools and good practices. In: 6th International Conference on Contemporary Computing, IC3’13, pp 404–409. Noida (2013)

  23. Kivity, A., Kamay, Y., Laor, D., Lublin, U., Liguori, A.: KVM: the Linux virtual machine monitor. In: Ottawa Linux Symposium, OLS’07, pp 225–230. Ottawa (2007)

  24. Li, A., Yang, X., Kandula, S., Zhang, M.: CloudCmp: comparing public cloud providers. In: 10th ACM Internet Measurement Conference, IMC’10, pp 1–14. Melbourne (2010)

  25. Mell, P., Grance, T.: The NIST definition of cloud computing. Commun. ACM 53(6), 46–51 (2010)

    Article  Google Scholar 

  26. Merkel, D.: Docker: lightweight Linux containers for consistent development and deployment. Linux J. (239):76–91 (2014)

  27. Mizusawa, N., Nakazima, K., Yamaguchi, S.: Performance evaluation of file operations on OverlayFS. In: 5th International Symposium on Computing and Networking, CANDAR’17, pp 597–599. Aomori (2017)

  28. OpenStack Installation Tutorial for Red Hat Enterprise Linux and CentOS: http://docs.openstack.org/newton/install-guide-rdo/. Last visited: June 2018

  29. Peinl, R., Holzschuher, F., Pfitzer, F.: Docker cluster management for the cloud—survey results and own solution. J. Grid Comput. 14(2), 265–282 (2016)

    Article  Google Scholar 

  30. Rackspace website: https://www.rackspace.com. Last visited: June 2018

  31. Ramon-Cortes, C., Serven, A., Ejarque, J., Lezzi, D., Badia, R.M.: Transparent orchestration of task-based parallel applications in containers platforms. J. Grid Comput. 16(1), 137–160 (2018)

    Article  Google Scholar 

  32. Ronacher, A.: Flask, a Python microframework. http://flask.pocoo.org/. Last visited: June 2018

  33. Sefraoui, O., Aissaoui, M., Eleuldj, M.: OpenStack: toward an open-source solution for cloud computing. Int. J. Comput. Appl. 55(3), 38–42 (2012)

    Google Scholar 

  34. Shafer, J.: I/O virtualization bottlenecks in cloud computing today. In: 2nd Workshop on I/O Virtualization, WIOV’10, pp 5:1–5:7. Pittsburgh (2010)

  35. Shafer, J., Rixner, S., Cox, A.L.: The Hadoop distributed filesystem: balancing portability and performance. In: IEEE International Symposium on Performance Analysis of Systems & Software, ISPASS’10, pp 122–133. White Plains (2010)

  36. Shamsi, J., Khojaye, M.A., Qasmi, M.A.: Data-intensive cloud computing: requirements, expectations, challenges, and solutions. J. Grid Comput. 11(2), 281–310 (2013)

    Article  Google Scholar 

  37. Shue, D., Freedman, M.J., Shaikh, A.: Performance isolation and fairness for multi-tenant cloud storage. In: 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI’12, pp 349–362. Hollywood (2012)

  38. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST’10, pp 1–10. Incline Village (2010)

  39. Soltesz, S., Pötzl, H., Fiuczynski, M.E., Bavier, A., Peterson, L.: Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors. In: 2nd ACM European Conference on Computer Systems, EuroSys’07, pp 275–287. Lisbon (2007)

  40. Tihfon, G.M., Park, S., Kim, J., Kim, Y.M.: An efficient multi-task PaaS cloud infrastructure based on Docker and AWS ECS for application deployment. Cluster Comput. 19(3), 1585–1597 (2016)

    Article  Google Scholar 

  41. Varadarajan, V., Kooburat, T., Farley, B., Ristenpart, T., Swift, M.M.: Resource-freeing attacks: improve your cloud performance (at your neighbor’s expense). In: 19th ACM Conference on Computer and Communications Security, CCS’12, pp 281–292. Raleigh (2012)

  42. Vavilapalli, V.K., et al.: Apache Hadoop YARN: Yet Another Resource Negotiator. In: 4th Annual Symposium on Cloud Computing, SOCC’13, pp 5:1–5:16. Santa Clara (2013)

  43. Veiga, J., Enes, J., Expósito, R.R., Touriño, J.: BDEv 3.0: Energy efficiency and microarchitectural characterization of big data processing frameworks. Futur. Gener. Comput. Syst. 86, 565–581 (2018)

    Article  Google Scholar 

  44. Wu, J., Ping, L., Ge, X., Wang, Y., Fu, J.: Cloud storage as the infrastructure of cloud computing. In: International Conference on Intelligent Computing and Cognitive Informatics, ICICCI’10, pp 380–383. Kuala Lumpur (2010)

  45. Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: Simple Linux Utility for Resource Management. In: 9th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP’03, pp 44–60. Seattle (2003)

  46. Younge, A.J., Henschel, R., Brown, J.T., Von Laszewski, G., Qiu, J., Fox, G.C.: Analysis of virtualization technologies for high performance computing environments. In: 4th IEEE International Conference on Cloud Computing, CLOUD’11, pp 9–16. Washington DC (2011)

  47. Zaharia, M., et al.: Apache Spark: a unified engine for Big Data processing. Commun. ACM 59 (11), 56–65 (2016)

    Article  Google Scholar 

  48. Zeng, W., Zhao, Y., Ou, K., Song, W.: Research on cloud storage architecture and key technologies. In: 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, ICIS’09, pp 1044–1048. Seoul (2009)

Download references

Acknowledgements

This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain (Project TIN2016-75845-P, AEI/FEDER, EU), and by the FPU Program of the Ministry of Education (grant FPU15/03381).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jonatan Enes.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Enes, J., Cacheiro, J.L., Expósito, R.R. et al. Big Data-Oriented PaaS Architecture with Disk-as-a-Resource Capability and Container-Based Virtualization. J Grid Computing 16, 587–605 (2018). https://doi.org/10.1007/s10723-018-9460-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-018-9460-4

Keywords

Navigation