Abstract
This paper presents a Tensor based Data Model (TDM) for polystore systems meant to address two major closely related issues in big data analytics architectures, namely logical data independence and data impedance mismatch. The TDM is an expressive model that subsumes traditional data models, it allows to link different data models of various data stores, and which also facilitates data transformations by using operators with clearly defined semantics. Our contribution is twofold. Firstly, it is the addition of the notion of a schema for the tensor mathematical object using typed associative arrays. Secondly, it is the definition of a set of operators to manipulate data through the TDM. In order to validate our approach we first show how our TDM model is inserted into a given polystore architecture. We then describe some use cases of real analyses using our TDM and its operators in the context of the French Presidential Election in 2017.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Single Instruction Multiple Data.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
The notation | is the restriction applied to sets, \(A|B=A-(A-B)\).
- 27.
expr is a logical expression to compare values of \(\varvec{\mathcal {X}}\) to constants. Its form is as follows: expr : : = <condition\(>\vert<\)condition> <logical operator> <condition\(> \vert \lnot<\)condition\(> \vert \) (<condition>)
Logical operators are \(\{\wedge , \vee \}\) condition> : : = values of \(\varvec{\mathcal {X}}\) (implicit) <comparison operator> constantComparison operators are \(\{<,\le , =,\ne ,\ge ,>\}\).
- 28.
expr allows to compare keys of the dimensions with constants. Its shape is the same as for the operator \(\sigma \) except for
<condition> : : = name of a dimension <comparison operator> constant.
- 29.
- 30.
- 31.
- 32.
References
Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: In-database learning with sparse tensors. In: ACM SIGMOD/PODS Symposium on Principles of Database Systems, pp. 325–340 (2018)
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endowment 2(1), 922–933 (2009)
Al-Garadi, M.A., et al.: Analysis of online social network connections for identification of influential users: survey and open research issues. ACM Comput. Surv. (CSUR) 51(1), 1–37 (2018)
Allen, D., Hodler, A.: Weave together graph and relational data in apache spark. In: Spark+AI Summit. Neo4j (2018). https://vimeo.com/274433801
Alsubaiee, S., et al.: AsterixDB: a scalable, open source BDMS. Proc. VLDB Endow. 7(14), 1905–1916 (2014)
Angles, R.: A comparison of current graph database models. In: IEEE International Conference on Data Engineering Workshops (ICDEW), pp. 171–177 (2012)
Astrahan, M.M., et al.: System R: relational approach to database management. ACM Trans. Database Syst. (TODS) 1(2), 97–137 (1976)
Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., Paleczny, M.: Workload analysis of a large-scale key-value store. ACM SIGMETRICS Perform. Evaluation Rev. 40, 53–64 (2012)
Austin, W., Ballard, G., Kolda, T.G.: Parallel tensor compression for large-scale scientific data. In: IEEE International Parallel and Distributed Processing Symposium, pp. 912–922 (2016)
Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: Extending Database Technology (EDBT), p. 222, 233 (2017)
Barabási, A.L., et al.: Network Science. Cambridge University Press, Cambridge (2016)
Battaglino, C., Ballard, G., Kolda, T.: A practical randomized CP tensor decomposition. arXiv preprint arXiv:1701.06600 (2017)
Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008)
Brodie, M.L., Schmidt, J.W.: Final report of the ANSI/X3/SPARC DBS-SG relational database task group. ACM SIGMOD Rec. 12(4), 1–62 (1982)
Bugiotti, F., Bursztyn, D., Deutsch, A., Ileana, I., Manolescu, I.: Invisible glue: scalable self-tuning multi-stores. In: Conference on Innovative Data Systems Research (CIDR) (2015)
Bugiotti, F., Bursztyn, D., Deutsch, A., Manolescu, I., Zampetakis, S.: Flexible hybrid stores: constraint-based rewriting to the rescue. In: International Conference on Data Engineering (ICDE), pp. 1394–1397 (2016)
Buluc, A., Gilbert, J.: On the representation and multiplication of hypersparse matrices. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–11 (2008)
Chen, J., Huang, Q.: Eliminating the Impedance Mismatch Between Relational Systems and Object-Oriented Programming Languages. Monash University, Clayton (1995)
Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blindsource Separation. Wiley, Hoboken (2009)
De Domenico, M., et al.: Mathematical formulation of multilayer networks. Phys. Rev. X 3(4), 041022 (2013)
Deng, D., et al.: The data civilizer system. In: Conference on Innovative Data Systems Research (CIDR) (2017)
DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Proceedings of the International Conference on Management of Data, pp. 295–310. ACM (2016)
Dittrich, J., Jindal, A.: Towards a one size fits all database architecture. In: Conference on Innovative Data Systems Research (CIDR), pp. 195–198 (2011)
Duggan, J., et al.: The BigDAWG polystore system. ACM SIGMOD Rec. 44(2), 11–16 (2015)
Färber, F., et al.: The SAP HANA database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)
Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2016)
Gama, J.: A survey on learning from data streams: current and future trends. Prog. Artif. Intell. 1(1), 45–55 (2012)
Ghosh, D.: Multiparadigm data storage for enterprise applications. IEEE Soft. 27(5), 57–60 (2010)
Giannakouris, V., Papailiou, N., Tsoumakos, D., Koziris, N.: MuSQLE: distributed SQL query execution over multiple engine environments. In: IEEE International Conference on Big Data, pp. 452–461 (2016)
Gray, J., Liu, D.T., Nieto-Santisteban, M., Szalay, A., DeWitt, D.J., Heber, G.: Scientific data management in the coming decade. ACM SIGMOD Rec. 34(4), 34–41 (2005)
Haerder, T., Reuter, A.: Principles of transaction-oriented database recovery. ACM Comput. Surv. (CSUR) 15(4), 287–317 (1983)
Halu, A., Mondragón, R.J., Panzarasa, P., Bianconi, G.: Multiplex pagerank. PloS ONE 8(10), e78293 (2013)
Hammer, M., McLeod, D.: On database management system architecture. Technical report, Massachusetts Institute of Technology, Cambridge Lab. For Computer Science (1979)
Härder, T.: DBMS architecture-the layer model and its evolution. Datenbank-Spektrum 13, 45–57 (2005)
Hellerstein, J.M., et al.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)
Hewasinghage, M., Varga, J., Abelló, A., Zimányi, E.: Managing polyglot systems metadata with hypergraphs. In: Trujillo, J.C., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 463–478. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00847-5_33
Hölsch, J., Schmidt, T., Grossniklaus, M.: On the performance of analytical and pattern matching graph queries in Neo4j and a relational database. In: EDBT/ICDT International Workshop on Querying Graph Structured Data (GraphQ) (2017)
Hutchison, D., Howe, B., Suciu, D.: Lara: a key-value algebra underlying arrays and relations. arXiv preprint arXiv:1604.03607 (2016)
Hutchison, D., Howe, B., Suciu, D.: LaraDB: A minimalist kernel for linear and relational algebra computation. In: ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 2–12 (2017)
Jananthan, H., Zhou, Z., Gadepally, V., Hutchison, D., Kim, S., Kepner, J.: Polystore mathematics of relational algebra. In: IEEE International Conference on Big Data, pp. 3180–3189 (2017)
Johnson, M., Rosebrugh, R., et al.: Database interoperability through state-based logical data independence. Int. J. Comput. Appl. Technol. 16(2–3), 97–102 (2003)
Kanellakis, P.C.: Elements of relational database theory. In: Formal models and semantics, pp. 1073–1156. Elsevier (1990)
Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 316–324 (2012)
Kepner, J., et al.: Dynamic distributed dimensional data model (D4M) database and computation system. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5349–5352 (2012)
Kepner, J., et al.: Achieving 100,000,000 database inserts per second using Accumulo and D4M. In: High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2014)
Kim, M.: TensorDB and tensor-relational model (TRM) for efficient tensor-relational operations (2014)
Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y., Porter, M.A.: Multilayer networks. J. Complex Netw. 2(3), 203–271 (2014)
Klug, A.: Equivalence of relational algebra and relational calculus query languages having aggregate functions. J. ACM 29(3), 699–717 (1982)
Knuth, D.: The Art of Computer Programming, Vol. 1: Fundamental Algorithms. Addison-Wesley, Boston (1978)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Kolev, B., Bondiombouy, C., Valduriez, P., Jiménez-Peris, R., Pau, R., Pereira, J.: The CloudMdsQL multistore system. In: International Conference on Management of Data (SIGMOD), pp. 2113–2116 (2016)
Kuang, L., Hao, F., Yang, L.T., Lin, M., Luo, C., Min, G.: A tensor-based approach for big data representation and dimensionality reduction. IEEE Trans. Emerg. Top. Comput. 2(3), 280–291 (2014)
Lämmel, R., Meijer, E.: Revealing the X/O impedance mismatch. In: Backhouse, R., Gibbons, J., Hinze, R., Jeuring, J. (eds.) SSDGP 2006. LNCS, vol. 4719, pp. 285–367. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76786-2_6
Leclercq, E., Savonnet, M.: TDM: A tensor data model for logical data independence in polystore systems. In: Heterogeneous Data Management, Polystores, and Analytics for Healthcare - VLDB 2018 Workshops, Poly and DMAH, pp. 39–56 (2018)
LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: souping up big data query processing with a multistore system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1591–1602 (2014)
Li, X., Cui, B., Chen, Y., Wu, W., Zhang, C.: MLog: towards declarative in-database machine learning. Proc. VLDB Endow. 10(12), 1933–1936 (2017)
Lin, J., Ryaboy, D.: Scaling big data mining infrastructure: the Twitter experience. SIGKDD Explor. Newsl. 14(2), 6–19 (2013)
Litwin, W., Abdellatif, A., Zeroual, A., Nicolas, B., Vigier, P.: MSQL: a multidatabase language. Inf. Sci. 49(1–3), 59–101 (1989)
Lu, J., Holubova, I.: Multi-model databases: a new journey to handle the variety of data. ACM Comput. Surv. (CSUR) 52(3), 55 (2019)
Maccioni, A., Torlone, R.: Augmented access for querying and exploring a Polystore. In: 34th International Conference on Data Engineering (ICDE), pp. 77–88. IEEE (2018)
Maier, D., Rozenshtein, D., Salveter, S., Stein, J., Warren, D.S.: Toward logical data independence: a relational query language without relations. In: ACM SIGMOD International Conference on Management of Data, pp. 51–60 (1982)
McGregor, A.: Graph stream algorithms: a survey. ACM SIGMOD Rec. 43(1), 9–20 (2014)
McHugh, J., Cuddihy, P.E., Williams, J.W., Aggour, K.S., Kumar, V.S., Mulwad, V.: Integrated access to big data polystores through a knowledge-driven framework. In: IEEE International Conference on Big Data, pp. 1494–1503 (2017)
Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ query language: configurable. Unifying and semi-structured. Technical report, UCSD (2015)
Ouzzani, M., Tang, N., Fernandez, R.C.: Data civilizer: end-to-end support for data discovery, integration, and cleaning. In: Making Databases Work, pp. 291–300. Association for Computing Machinery and Morgan & Claypool (2019)
Özsoyoğlu, G., Özsoyoğlu, Z.M., Matos, V.: Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Trans. Database Syst. 12(4), 566–592 (1987)
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the Web. In: Proceedings of the 7th International World Wide Web Conference, pp. 161–172 (1999)
Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Tensors for data mining and data fusion: models, applications, and scalable algorithms. ACM Trans. Intell. Syst. Technol. (TIST) 8(2), 16 (2017)
Riquelme, F., González-Cantergiani, P.: Measuring user influence on Twitter: a survey. Inf. Process. Manage. 52(5), 949–975 (2016)
Sharp, J., McMurtry, D., Oakley, A., Subramanian, M., Zhang, H.: Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence. Microsoft patterns & practices (2013)
Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., De Carvalho, A.C., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. (CSUR) 46(1), 13 (2013)
Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
Smith, S., Ravindran, N., Sidiropoulos, N.D., Karypis, G.: SPLATT: efficient and parallel sparse tensor-matrix multiplication. In: IEEE International Parallel and Distributed Processing Symposium, pp. 61–70 (2015)
Stonebraker, M., et al.: One size fits all? Part 2: benchmarking results. In: Conference on Innovative Data Systems Research (CIDR) (2007)
Stonebraker, M., Cetintemel, U.: “One size fits all”: an idea whose time has come and gone. In: International Conference on Data Engineering, ICDE 2005, pp. 2–11. IEEE (2005)
Stonebraker, M., et al.: C-store: a column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 553–564. VLDB Endowment (2005)
Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220 (2017)
Vargas-Solar, G., Zechinelli-Martini, J.L., Espinosa-Oviedo, J.A.: Big Data management: what to keep from the past to face future challenges? Data Sci. Eng. 2(4), 328–345 (2017)
Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot interactions: detection, estimation, and characterization. In: Proceedings of the 11th International Conference on Web and Social Media (ICWSM), pp. 280–289 (2017)
Vogt, M., Stiemer, A., Schuldt, H.: Icarus: towards a multistore database system. In: IEEE International Conference on Big Data, pp. 2490–2499 (2017)
Wang, J., et al.: The Myria big data management and analytics system and cloud services. In: Conference on Innovative Data Systems Research (CIDR)
Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)
Wu, D., Sakr, S., Zhu, L.: Big Data programming models. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 31–63. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4_2
Acknowledgement
This research was partially supported by the project I-SITE UBFC COCKTAIL. We thank George Becker for comments that have greatly improved the manuscript and Arnaud Da Costa for the maintenance of the server infrastructure.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Leclercq, É., Gillet, A., Grison, T., Savonnet, M. (2019). Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics. In: Hameurlain, A., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XLII. Lecture Notes in Computer Science(), vol 11860. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-60531-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-662-60531-8_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-60530-1
Online ISBN: 978-3-662-60531-8
eBook Packages: Computer ScienceComputer Science (R0)