Abstract
Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.
About the author

Prof. Dr. Ziawasch Abedjan studied IT Systems Engineering at the Hasso Plattner Institute Potsdam (Bachelor 2008, Master 2010, PhD 2014). From 2014 to 2016, he was a postdoc in the database group at MIT with Michael Stonebraker and Samuel Madden. From 2016 to 2020 he was Juniorprofessor at the TU Berlin and Senior Researcher at DFKI. In 2020, he became full professor at the Leibniz University Hannover chairing the databases and information systems group. Since 2021 he is also visiting academic at Amazon. His research is focused on democratizing and optimizing data science workflows and improving data quality for analytical applications. He is member of the L3S research center and fellow of the Berlin Institute for Learning and Data. Prof. Dr. Ziawasch Abedjan was honored with a junior-fellowship of the Gesellschaft für Informatik e. V.
References
1. Z. Abedjan, H. Anuth, M. Esmailoghli, M. Mahdavi, F. Neutatz, and B. Chen. Data science für alle: Grundlagen der datenprogrammierung. Inform. Spektrum, 43(2):129–136, 2020.10.1007/s00287-020-01253-8Search in Google Scholar
2. Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? In PVLDB, pages 933–1004, 2016.10.14778/2994509.2994518Search in Google Scholar
3. Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. The VLDB Journal, 24(4):557–581, 2015.10.1007/s00778-015-0389-ySearch in Google Scholar
4. D. Deng, R. Castro Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. Elmagarmid, I. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.10.1145/3035918.3058740Search in Google Scholar
5. M. Esmailoghli, J. Quiané-Ruiz, and Z. Abedjan. COCOA: correlation coefficient-aware data augmentation. In Y. Velegrakis, D. Zeinalipour-Yazti, P. K. Chrysanthis, and F. Guerra, editors, Proceedings of the 24th International Conference on Extending Database Technology (EDBT), pages 331–336. OpenProceedings.org, 2021.Search in Google Scholar
6. R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. Aurum: A data discovery system. In ICDE, pages 1001–1012, 2018.Search in Google Scholar
7. R. C. Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. A demo of the data civilizer system. In SIGMOD, pages 1639–1642, 2017.Search in Google Scholar
8. M. Mahdavi and Z. Abedjan. Reds: Estimating the performance of error detection strategies based on dirtiness profiles. In SSDBM, 2019.10.1145/3335783.3335808Search in Google Scholar
9. M. Mahdavi and Z. Abedjan. Baran: Effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow., 13(11):1948–1961, 2020.10.14778/3407790.3407801Search in Google Scholar
10. M. Mahdavi, Z. Abedjan, R. C. Fernandez, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Raha: A configuration-free error detection system. In SIGMOD, 2019.10.1145/3299869.3324956Search in Google Scholar
11. E. Mansour, D. Deng, R. C. Fernandez, A. A. Qahtan, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Building data civilizer pipelines with an advanced workflow engine. In ICDE, pages 1593–1596, 2018.10.1109/ICDE.2018.00184Search in Google Scholar
12. F. Neutatz, F. Biessmann, and Z. Abedjan. Enforcing constraints for machine learning systems via declarative feature selection: An experimental study. In G. Li, Z. Li, S. Idreos, and D. Srivastava, editors, International Conference on Management of Data (SIGMOD), pages 1345–1358. ACM, 2021.10.1145/3448016.3457295Search in Google Scholar
13. F. Neutatz, M. Mahdavi, and Z. Abedjan. ED2: A case for active learning in error detection. In W. Zhu, D. Tao, X. Cheng, P. Cui, E. A. Rundensteiner, D. Carmel, Q. He, and J. X. Yu, editors, Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), pages 2249–2252, 2019.10.1145/3357384.3358129Search in Google Scholar
14. R. Salazar, F. Neutatz, and Z. Abedjan. Automated feature engineering for algorithmic fairness. Proc. VLDB Endow., 14(9):1694–1702, 2021.10.14778/3461535.3463474Search in Google Scholar
15. L. Visengeriyeva and Z. Abedjan. Metadata-driven error detection. In SSDBM, pages 1–12, 2018.10.1145/3221269.3223028Search in Google Scholar
16. L. Visengeriyeva and Z. Abedjan. Anatomy of metadata for data curation. ACM J. Data Inf. Qual., 12(3):16:1–16:30, 2020.10.1145/3371925Search in Google Scholar
© 2022 Walter de Gruyter GmbH, Berlin/Boston