A distributed data exchange engine for polystores

  • 1 Technical University of Berlin, Berlin, Germany
  • 2 Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
Dr. Abdulrahman Kaitoua
  • Corresponding author
  • Technical University of Berlin, Berlin, Germany
  • Email
  • Further information
  • Abdulrahman Kaitoua is a Senior big data architect and a team lead in the innovation and research team of GK-Software SE company in Berlin, Germany. He received his Ph. D. with honor in Information Technology from Politecnico di Milano in 2017.
  • Search for other articles:
  • degruyter.comGoogle Scholar
, Prof. Dr. Tilmann Rabl
  • Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
  • Email
  • Further information
  • Tilmann Rabl is a full professor and Chair of the Data Engineering Systems Group at Hasso Plattner Institute and the University of Potsdam. He is also cofounder of the startup bankmark.
  • Search for other articles:
  • degruyter.comGoogle Scholar
and Prof. Dr. Volker Markl
  • Technical University of Berlin, Berlin, Germany
  • Email
  • Further information
  • Volker Markl is a Full Professor and Chair of the DIMA Group at TU Berlin and an Adjunct Full Professor at the University of Toronto. He is Director of the Intelligent Analytics for Massive Data Research Group at DFKI and Director of the Berlin Big Data Center.
  • Search for other articles:
  • degruyter.comGoogle Scholar

Abstract

There is an increasing interest in fusing data from heterogeneous sources. Combining data sources increases the utility of existing datasets, generating new information and creating services of higher quality. A central issue in working with heterogeneous sources is data migration: In order to share and process data in different engines, resource intensive and complex movements and transformations between computing engines, services, and stores are necessary.

Muses is a distributed, high-performance data migration engine that is able to interconnect distributed data stores by forwarding, transforming, repartitioning, or broadcasting data among distributed engines’ instances in a resource-, cost-, and performance-adaptive manner. As such, it performs seamless information sharing across all participating resources in a standard, modular manner. We show an overall improvement of 30 % for pipelining jobs across multiple engines, even when we count the overhead of Muses in the execution time. This performance gain implies that Muses can be used to optimise large pipelines that leverage multiple engines.

  • 1.

    Hadoop Yarn, https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.

  • 2.

    SciDB array DataBase, http://www.paradigm4.com/.

  • 3.

    DistCp tool, Apache Hadoop, https://hadoop.apache.org/docs/current3/hadoop-distcp/DistCp.html.

  • 4.

    Haynes, Brandon and Cheung, Alvin and Balazinska, Magdalena, PipeGen: Data Pipe Generator for Hybrid Analytics, Proceedings of the Seventh ACM Symposium on Cloud Computing (SOCC), 2016 Oct 5 (pp. 470–483). ACM.

  • 5.

    Dziedzic, Adam and Elmore, Aaron J and Stonebraker, Michael Data transformation and migration in polystores, IEEE High Performance Extreme Computing Conference (HPEC) 2016 Sep 13 (pp. 1–6). IEEE.

  • 6.

    J. Duggan, A. Elmore J, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, S. Zdonik, The bigdawg polystore system, ACM Sigmod Record. 2015 Aug 12; 44(2):11–16.

    • Crossref
    • Export Citation
  • 7.

    Apache Arrow, https://arrow.apache.org/.

  • 8.

    AKKA, http://akka.io/.

  • 9.

    Apache Flink, https://flink.apache.org/.

  • 10.

    Apache Spark, https://spark.apache.org/.

  • 11.

    Encode: Encyclopedia of DNA Elements, https://www.encodeproject.org/.

  • 12.

    H. Lim, Y. Han, S. Babu, How to Fit when No One Size Fits., CIDR 2013; 4:35.

  • 13.

    K. Shvachko, H. Kuang, S. Radia, R. Chansler, The hadoop distributed file system, MSST 2010 May 3; 10:1–10.

  • 14.

    A. Kaitoua, P. Pinoli, M. Bertoni, S. Ceri, Framework for supporting genomic operations, IEEE Transactions on Computers. 2016 Aug 29; 66(3):443–457.

  • 15.

    Lo, Ming-Ling and Ravishankar, Chinya V, (Spatial hash-joins, ACM SIGMOD Record 1996 Jun 1; 25(2):247–258). ACM.

    • Crossref
    • Export Citation
  • 16.

    M. Bertoni, S. Ceri, A. Kaitoua, P. Pinoli, Evaluating cloud frameworks on genomic applications, IEEE International Conference on Big Data (Big Data) 2015 Oct 29 (pp. 193–202). IEEE.

  • 17.

    A. Kaitoua, T. Rabl, A. Katsifodimos, V. Markl, Muses: Distributed Data Migration System for Polystores, IEEE 35th International Conference on Data Engineering (ICDE) 2019 Apr 8 (pp. 1602–1605).

  • 18.

    D. Agrawal, S. Chawla, A. Elmagarmid, Z. Kaoudi, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, M. Zaki, Road to Freedom in Big Data Analytics, EDBT 2016 Jan 1 (pp. 479–484).

  • 19.

    A. Jindal, J. Quiané-Ruiz, S. Madden, CARTILAGE: adding flexibility to the Hadoop skeleton, Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data 2013 Jun 22 (pp. 1057–1060). ACM.

  • 20.

    A. Gupta, V. Gadepally, M. Stonebraker, Cross-engine query execution in federated database systems, IEEE High Performance Extreme Computing Conference (HPEC) 2016 Sep 13 (pp. 1–6). IEEE.

  • 21.

    S. Cattani, S. Ceri, A. Kaitoua, P. Pinoli, Evaluating Genomic Big Data Operations on SciDB and Spark, International Conference on Web Engineering 2017 Jun 5 (pp. 482–493). Springer, Cham.

Purchase article
Get instant unlimited access to the article.
$42.00
Log in
Already have access? Please log in.


or
Log in with your institution

Journal + Issues

it - Information Technology is a strictly peer-reviewed scientific journal. It is the oldest German journal in the field of information technology. Today, the major aim of it - Information Technology is highlighting issues on ongoing newsworthy areas in information technology and informatics and their application. It aims at presenting the topics with a holistic view.

Search