Accessible Unlicensed Requires Authentication Published by De Gruyter Oldenbourg March 4, 2020

A distributed data exchange engine for polystores

Abdulrahman Kaitoua, Tilmann Rabl and Volker Markl

Abstract

There is an increasing interest in fusing data from heterogeneous sources. Combining data sources increases the utility of existing datasets, generating new information and creating services of higher quality. A central issue in working with heterogeneous sources is data migration: In order to share and process data in different engines, resource intensive and complex movements and transformations between computing engines, services, and stores are necessary.

Muses is a distributed, high-performance data migration engine that is able to interconnect distributed data stores by forwarding, transforming, repartitioning, or broadcasting data among distributed engines’ instances in a resource-, cost-, and performance-adaptive manner. As such, it performs seamless information sharing across all participating resources in a standard, modular manner. We show an overall improvement of 30 % for pipelining jobs across multiple engines, even when we count the overhead of Muses in the execution time. This performance gain implies that Muses can be used to optimise large pipelines that leverage multiple engines.

ACM CCS:

Funding statement: This work has been supported through grants by the German Ministry for Education and Research as s BIFOLD (01IS18025A and 01IS18037).

References

1. Hadoop Yarn, https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html. Search in Google Scholar

2. SciDB array DataBase, http://www.paradigm4.com/. Search in Google Scholar

3. DistCp tool, Apache Hadoop, https://hadoop.apache.org/docs/current3/hadoop-distcp/DistCp.html. Search in Google Scholar

4. Haynes, Brandon and Cheung, Alvin and Balazinska, Magdalena, PipeGen: Data Pipe Generator for Hybrid Analytics, Proceedings of the Seventh ACM Symposium on Cloud Computing (SOCC), 2016 Oct 5 (pp. 470–483). ACM. Search in Google Scholar

5. Dziedzic, Adam and Elmore, Aaron J and Stonebraker, Michael Data transformation and migration in polystores, IEEE High Performance Extreme Computing Conference (HPEC) 2016 Sep 13 (pp. 1–6). IEEE. Search in Google Scholar

6. J. Duggan, A. Elmore J, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, S. Zdonik, The bigdawg polystore system, ACM Sigmod Record. 2015 Aug 12; 44(2):11–16. Search in Google Scholar

7. Apache Arrow, https://arrow.apache.org/. Search in Google Scholar

8. AKKA, http://akka.io/. Search in Google Scholar

9. Apache Flink, https://flink.apache.org/. Search in Google Scholar

10. Apache Spark, https://spark.apache.org/. Search in Google Scholar

11. Encode: Encyclopedia of DNA Elements, https://www.encodeproject.org/. Search in Google Scholar

12. H. Lim, Y. Han, S. Babu, How to Fit when No One Size Fits., CIDR 2013; 4:35. Search in Google Scholar

13. K. Shvachko, H. Kuang, S. Radia, R. Chansler, The hadoop distributed file system, MSST 2010 May 3; 10:1–10. Search in Google Scholar

14. A. Kaitoua, P. Pinoli, M. Bertoni, S. Ceri, Framework for supporting genomic operations, IEEE Transactions on Computers. 2016 Aug 29; 66(3):443–457. Search in Google Scholar

15. Lo, Ming-Ling and Ravishankar, Chinya V, (Spatial hash-joins, ACM SIGMOD Record 1996 Jun 1; 25(2):247–258). ACM. Search in Google Scholar

16. M. Bertoni, S. Ceri, A. Kaitoua, P. Pinoli, Evaluating cloud frameworks on genomic applications, IEEE International Conference on Big Data (Big Data) 2015 Oct 29 (pp. 193–202). IEEE. Search in Google Scholar

17. A. Kaitoua, T. Rabl, A. Katsifodimos, V. Markl, Muses: Distributed Data Migration System for Polystores, IEEE 35th International Conference on Data Engineering (ICDE) 2019 Apr 8 (pp. 1602–1605). Search in Google Scholar

18. D. Agrawal, S. Chawla, A. Elmagarmid, Z. Kaoudi, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, M. Zaki, Road to Freedom in Big Data Analytics, EDBT 2016 Jan 1 (pp. 479–484). Search in Google Scholar

19. A. Jindal, J. Quiané-Ruiz, S. Madden, CARTILAGE: adding flexibility to the Hadoop skeleton, Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data 2013 Jun 22 (pp. 1057–1060). ACM. Search in Google Scholar

20. A. Gupta, V. Gadepally, M. Stonebraker, Cross-engine query execution in federated database systems, IEEE High Performance Extreme Computing Conference (HPEC) 2016 Sep 13 (pp. 1–6). IEEE. Search in Google Scholar

21. S. Cattani, S. Ceri, A. Kaitoua, P. Pinoli, Evaluating Genomic Big Data Operations on SciDB and Spark, International Conference on Web Engineering 2017 Jun 5 (pp. 482–493). Springer, Cham. Search in Google Scholar

Received: 2019-10-07
Revised: 2020-02-14
Accepted: 2020-02-21
Published Online: 2020-03-04
Published in Print: 2020-05-27

© 2020 Walter de Gruyter GmbH, Berlin/Boston