In recent years Machine Learning (ML) and Data Science (DS) have grown into significant fields. They impact many other disciplines and enable numerous technological innovations. ML, which we use as an umbrella term for other data-centric disciplines like data mining, knowledge discovery and DS, relies mostly on quantitative, empirically driven research. This is comparable to other computer science fields like for example information retrieval. Similar to information retrieval, ML research faces challenges with replicability, reproducibility and presenting complex empirical results in traditional, paper-based formats in a suitable way , . Although information retrieval research has a long-standing record for conducting systematic evaluations, researchers have shown that improvements do not add up as reported in the literature . These shortcomings are reflected in the development of standardised benchmark suites akin to MLPerf1 as well as in the rise of evaluation challenges like TREC2 or CLEF.3 We insinuate that finding efficient baseline solutions for the defined challenges often gets hindered by difficulties in solving the aforementioned challenges.
Similar problems hold true for ML  yielding to what certain authors call a reproducibility crisis . However, the challenges go beyond reproducibility. For example, even if algorithms are open source and therefore freely available, the lack of high-performance computing infrastructures as well as the lack of datasets hinder researchers from reproducing experimental results and comparing them with their own. Similarly, results presented in research papers often lack details due to space constraints. For example, the Dagstuhl seminar on reproducibility of data-oriented experiments in e-Science  stated that computational experiments still tend to be published mostly informally in papers. This kind of publishing hinders research by impeding reusability, extendability, and replicability. Furthermore, it endangers the reliability of published results through having unverifiable trust in incomplete information. A set of mechanisms needs to be employed in the scientific landscape to develop a standard and minimum common understanding for ensuring reproducible experiments.
We argue that solutions to these challenges require an open, transparent and tool-supported approach in developing, sharing and communicating achieved research results. Thereby, such solutions need to go beyond traditional paper-based reporting. While some tools and platforms for doing so already exists, it remains unclear how they contribute to an open research process and whether those tools and platforms satisfy all requirements for reproducible and replicable experiments. In this paper, we try to address the requirements for an open, transparent and reproducible ML research by developing an Open Science process model for ML. Furthermore, we compare existing platforms according to the developed process. While strictly speaking, some properties like reproducibility, replicability, comparability, and understandability are inherent to any scientific process, we consider them essential for Open Science and therefore use the term “Open Science” in a broader sense. From our point of view, ML research needs to be more accessible when considering Open Science principles for delivering true advancements. Experiments need to report micro and macro decisions as well as documented parameter and hardware settings. Evaluation measures need to go beyond aggregated statistics allowing researchers to explore and compare experiments with low effort. Consequently, we review the contribution of a number of openly available tools to a possible open and transparent research process.
2 Core contributions
We make two core-contributions in this article, namely:
We analyse existing challenges associated with ML and DS research and develop an Open Science Process Model for Machine Learning (OSPMML). The model centres around the concept of Structured Results, which aggregate research results and provide the necessary structures for proper tool support. While process models for research in other fields like the Design Science Research Methodology (DSRM) process model  deliver only a good definition for their respective field, general Open Science models giving an extensive abstract frame like the lifecycle of the Open Science Framework (OSF)  tend to omit details of ML itself. Employing them in the context of ML can, therefore, be impractical. With the OSPMML we emphasise the particular requirements of ML and DS research. In contrast to existing ML Process models like the KDD-Process  and the CRISP-DM model , we focus on Open Science principles like sharing research artefacts and align general scientific principles like replicability, reproducibility, comparability, and understandability to the challenges observed in ML research. In our opinion research artefacts like publications, datasets, repositories, and operating software as well as relations between research artefacts need to be identifiable and persisted. This should at best include semantic information about the usage, provenance and structure of their content to facilitate replicability .
As a second main contribution, we review existing platforms towards the requirements imposed by our OSPMML. Subsequently, we identify blank spots for supporting a fully Open Science approach with efficient and reproducible knowledge sharing. We argue, that explicit formalisation of knowledge, as well as standardisation of experimental results is necessary for improving the current ML research cycle and the scientific quality in the field.
3 Related work
We now introduce related models and surveys. The lifecycle of the OSF repository4 envelops the whole process of scientific research. It embeds specified solutions like gitlab5 for version control and figshare6 for result communication in a platform capable of managing projects structures. While the OSF repository itself can reduce the danger of information gaps, the execution of experiments themselves and data collection respectively is abstracted.
The CRISP-DM model defines the following 6 phases for data mining. Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. It was developed with contributions of the EU and companies like Daimler Chrysler and includes an integration into business decision processes to enable companies to define projects based on the model. Considering scientific research, it falls short on securing Open Science principles like accessibility, openness, and transparency. Phases like Business Understanding and Deployment may be interpreted in a scientific context containing, for example, a state-of-the-art analysis or the publication of results, but deliver no framework for the respective processes.
Similarly, the KDD-Process Model describes the data mining process from data selection in databases to extraction of knowledge, while omitting the management and distribution of knowledge. Main phases of KDD include Data Cleaning, Data Integration, Data Selection, Data Transformation, Data Mining, and Data Interpretation.
In 2016 the Schloss Dagstuhl institute hosted a seminar concerning the reproducibility of experimental research in computer science . They described a set of tools in an overview with a short introduction. In the released report on reproducibility of data-oriented experiments in e-Science it was concluded that computational experiments still tend to be published informally in papers, which results in isolation and hindrance of results and research by impeding reusability, extendability, and replicability and therefore endanger reliability requiring unverifiable trust in incomplete information. A set of mechanisms needs to be employed in the scientific landscape to instigate a standard and minimum grade for reproducible experiments. The state of HCI (Human–computer Interaction) in the context of ML was analysed with an UX (user-experience) design survey by Graham Dove et al. . Fifty-one respondents including practitioners were questioned on their views of current development and challenges regarding the usage of ML technology. The authors concluded that the usage of ML technology is still in it’s infancy. To enable efficient employment of ML in design practice, a set of challenges need to be addressed. We derive a more experimental, research-oriented view of ML challenges in Section 4.2.
4 An open science process model for machine learning
We start developing our Open Science process model by analysing challenges and pitfalls in existing ML and DS process models. In particular, the KDD-Process  and the CRISP-DM model  deliver an often cited process model for DS as described by Gregory Piatetsky . After identifying a lack of consideration in classical ML process models, we mainly focus on research collaborations and their demands. This focus is motivated by the basic reasons for collaboration found in expertise sharing, productivity increase, and quality control as presented by Benedikt Fecher and Sascha Friesike  and discussed in conference reports .
After identifying challenges, we determine Open Science properties, namely replicating and reproducing ML experiments as well as comparing and understanding results. Additionally, we deduce requirements and steps for an Open Science process model. We argue that a single tool chain, centred around structured results and formalised claims, should ideally support all these tasks. These requirements allow a subsequent categorisation and analysis of existing (open source) tools according to enabling such a tool-chain.
4.1 The core machine learning workflow
At its core, ML and all related disciplines follow a strongly data-driven approach , . This approach can be summarised in five core steps as depicted in Figure 1, namely data selection, data preprocessing, model development, evaluation, and analysis.
Data Selection and Cleaning: A ML task usually starts from given data and selects task-relevant data, usually termed datasets, in this first step. Data selection is guided by the available data, the task itself, and by the experimental protocol in case of a research setting.
Data Preprocessing maps the selected data into a formal, usually mathematical representation (e. g., vector space models, tensors, or graphs) usable by ML algorithms. Feature engineering aims to find an optimal mapping and can be seen as crucial step for good model performance.
Model Development refers to the development of suitable ML and statistical models over the preprocessed data. It usually involves programming efforts when new algorithms and models are developed, or merely the application of existing algorithms.
Evaluation applies a strict experimental setup for evaluating the performance of the model. This includes splitting datasets in development, training, evaluation, and test set, defining meaningful evaluation measures as well as finding optimal hyper-parameter settings and optimisation strategies.
Analysis interprets the obtained evaluation results and deduces properties of the dataset and model. It can be seen as a knowledge formation or respectively knowledge creation process synthesising the results. This process also requires the comparison of different runs and models from potentially different communities.
4.2 Pitfalls and challenges of the core machine learning workflow
In order to understand necessary Open Science requirements for ML research, we need to analyse potential pitfalls and challenges of the ML process. When conducting ML research each of the steps of the core ML workflow outlined above requires precise execution. In our opinion, the research process as such has own challenges differing from the typical challenges for ML application like scalability or ease of use as they are described by Robert Grossman et al. . However, challenges can overlap. From our point of view, we can distinguish between five challenges particularly tied to ML research, namely Error Management, Technical Complexity, Experiment Planning, Resources and Scalability, and Result Communication. These challenges result from a combination of model analysis, practical experiences, and literature research in the scientific context. None of them can be seen as solitary core challenge given that already disregarding a single one of them can jeopardise the quality of research - especially as the list of pitfalls is not to be considered complete. G Peter Zhang gives a more technical perspective on pitfalls of artificial neural networks . This includes a fine-grained look into the task which we summarised as Error Management. Nevertheless it has to be noted, that these challenges are not only of technical nature and can’t be solved by tools alone. Further analysis may need to differentiate regarding the underlying rationale of needed skillsets, infrastructure or general conditions. A closer look into human factors is not given.
Error Management: Error Management involves the task of detecting and tracing errors in a ML experiment. Contrary to other research fields in and outside of Computer Science, finding errors in ML experiments can be particularly challenging. The outcome of an experiment strongly depends on the chosen hyper-parameters, dataset preparation, and feature engineering. Even wrong approaches can provide good results when the experimental setup contains errors. For example, including ground-truth features in the experimental setup could yield to a well-performing model on the test data that fails badly in real-world applications. Similarly, preprocessing errors like neglecting to normalise input data might provide usable results, but does not produce the best possible model. Due to the technical complexity outlined below, finding such errors can be tedious and time-consuming.
Technical Complexity: Conducting research in ML and DS involves the challenge of managing the associated technical complexity resulting from (i) the plethora of available programming languages, libraries, frameworks, and ecosystems (e. g. Python, R, Spark, Azure, Tensorflow, PyTorch, Weka etc.) and (ii) the often underestimated complexity of potential ML solutions. The UX Design analysis conducted by Graham Dove et al. manifests technical complexity in ML as a core problem for usability of ML technologies . In our view, this not only applies to employ ML algorithms in practice, but also translates to understanding and developing new solutions. The challenge of technical complexity has already to be tackled fundamentally during the research process. For example, non-trivial ML research involves managing large data sets, domain-specific feature engineering, algorithm engineering as well as distributed computing. Although available frameworks like Tensorflow, Spark etc. provide out-of-the-box solutions for handling this technical complexity, practitioners often use them as black-boxes in a trial-and-error manner. However, a limited understanding of the underlying processes, their parameters, and properties yield unsatisfying and potentially wrong results.
Experiment Planning: As in every scientific field, experiments for ML need rigorous planning, a precise experimental setup and a clear research hypothesis that should be answered. This also holds true for DS projects as described in the Business Understanding phase of the CRISP-DM model. However, the availability of software frameworks, technology stacks, and corresponding tutorials, as well as the ease of setting up and running experiments, facilitates a quick-and-dirty approach. Experiments can be assembled ad-hoc without much planning, making them error-prone. While some results can be achieved, the likelihood of undetected errors - as argued above - increases. This may also lead to a publication bias enabled by a high throughput of experiment execution and result evaluation. Furthermore, such often insufficiently documented experiments impact comparability, reproducibility, and understandability negatively. These problems may be tackled by conscientious work ethic and employing software engineering practices as debated by peers.7
Resources and Scalability: ML research depends significantly on the available resources like datasets, corresponding ground truth labelling and, more recently, the availability of pre-trained models. Especially for the processing of larger input data and increasingly complex models like Neural Networks trained on large image datasets, individual differences in access to computing resources limit the openness of ML research. As an effect, research results cannot be reproduced by everybody or can result in differing findings and claims.
Result Communication: Communicating ML results usually consists of reporting the experimental setup and aggregated measures on the achieved results (e. g. accuracy, precision, recall) in a traditional research paper. However, traditional research papers are not capable of communicating the whole technical complexity and providing all the necessary resources and experimental details to repeat the experiment or find errors in the results. Odd Erik Gundersen et al  concluded that reproducibility at top AI conferences is an relevant issue as current documentation practices render most of the results irreproducible. Hence, ML experiments are often hard to reproduce, compare, and understand. So while the before mentioned four challenges influence the validity of ML experiments, the inability to communicate these details via traditional means is a clear threat to the scientific process and requires a more transparent and open approach as recently discussed in the community.8
4.3 Open science in machine learning research
The aforementioned ML pitfalls and challenges in combination with today’s paper based research workflow are inefficient to realise core scientific principles. In particular, papers cannot provide replicability, reproducibility, comparability, understandability, and completeness, but these have to be supported to facilitate an Open Science environment. In scientific data management of Open Science the FAIR Guiding Principles have to be employed to enhance the infrastructure supporting the reuse of scholarly data . Data has to be findable, accessible, interoperable and reusable. Replicability and respectively reproducibility are currently a hot topic in science . In the bigger picture we argue that to secure interoperability and reusability of ML research, we also need to consider comparability, understandability, and completeness.
Replicability: We understand replicability as the exact reproduction of the original experiment as defined by Drummond . In other words, reproduction is executing the code again on the exact same dataset with the same environment. Although it seems to be a simple requirement, ML experiments are hard to replicate due to technical complexity and limitations in resource or artefact availability. Often, corresponding datasets are not provided or used in a way not conveyed in the experimental setup. The simple case of a missing random seed can drastically change the complete outcome of an algorithm’s execution. While for reproducibility switching up parameters like seeding numbers may be desirable, this impedes accurate replication of experimental outcomes utilising pseudorandom components. Furthermore, non-trivial training models usually require large scale infrastructures unavailable to most researchers and reviewers. Consequently, research is judged on aggregated results with partial knowledge on the experimental setup, which does not allow a solid, scientific quality control. Mixing in the often existing publish or perish mentality in academia, may result into an undue incentive to polish results or imprecise proceedings for the sake of impact.
Reproducibility: Following Drummond’s definition , reproducibility provides evidence to support claims made in a research paper, which might need several connatural experiments examining and developing argumentations based on a variety of aspects. This definition clearly extends replicability but suffers from the same problems as replicability when it comes to ML research. Also, reproducibility is limited by the chosen aggregated measures and the interpretation by the author. Since a research paper can not contain all experimental details and achieved results, readers need to rely on the result interpretation of the authors without being able to take on other perspectives or viewpoints without replicating the experiment itself. Consequently, reproducibility requires the ability to compare experiments, potentially across different research groups, and to improve the understanding obtained from single experiments, ideally without replicating them. We call these two properties Comparability and Understandability.
Comparability: We define comparability as the ability to compare two or more experiments. While replicability enables comparability in theory, it is a quite resource intensive process and, as outlined above, not always easy to achieve. Also, comparing two ML experiments based on aggregated numbers is insufficient due to the lack of details. Two experiments could achieve similar accuracy values, while ruling quite differently for individual decisions on dataset entries. Similarly, experiments with different dataset splits or different aggregated measures are not comparable in detail either. Missing comparability is particularly crucial during a peer review process, as it limits the ability to compare new claims to existing state-of-the-art.
Understandability: Understandability can be seen as the ability to draw conclusions and formalise claims from ML experiments. This is especially important to review past claims when obtaining new evidence. While for single research papers understandability is usually given, it does not easily extend beyond single experiments and across research groups due to the lack of replicability, reproducibility, and comparability. It also requires the possibility to express claims in a more formal way, which is not often the case.
Completeness: The capacity to ingest the whole state of the research field can be called Completeness. This includes deriving the current state-of-the-art and therefore overcoming the problem of hidden baselines. Completeness is hard to achieve since it relies on Comparability and Understandability as well as the extensiveness of the accessible information.
In order to achieve all five principles and to provide a complete and open body of knowledge in ML research, we need a more transparent, technologically supported and open research process on top of the core ML Workflow defined above. In the following, we define an Open Science Process Model for ML Research and discuss its properties (see Figure 1). In order to achieve a higher degree of transparency, we not only need to achieve the executable software, but also the achieved experimental results for every decision made by the ML systems. For example, in classification experiments conducted over different splits, we need to preserve every individual decision on the individual splits along with the model and the data-split itself. Given such transparent monitoring of results, we can achieve reproducibility and comparability of experiments.
Beyond the transparency of results, we also require a formalised knowledge base providing formal descriptions of the experiments and the supported claim. While archiving the executable software supports replicability, it does not necessarily provide comparability without making the experimental setup explicit in a comparable manner. For example, implementations of ML algorithms can impact the outcomes of analysis even for such easy applications as calculating the standard deviation . Hence, an implementation independent format is required in order to define experimental setups, hyper-parameters used as well as the claims made. We refer to these results as Structured Results, which includes not only the executable software and the obtained detailed results, but also a formal description of parameters, datasets and, optionally, claims made.
Achieved experiments and the formal descriptions constitute the key requirement for improving Communication, Quality Control, State-of-the-Art Analysis and also Resource Management - the four steps we defined in figure 1 to extend the core ML Process with Open Science capabilities. From our point of view, ML research needs to utilise online platforms in order to communicate in-depth results and enable readers to explore experimental results in-detail.
We refer to this first step as Structured Communication, which includes harmonising experimental results over different ML experiments and the different underlying software frameworks used. Results must be comparable in detail, which requires well-defined metadata schemas and formal languages. Ideally, results and claims can be published in a decentralised manner through individual research groups, while being machine-readable and consumable. These requirements hint towards the use of Linked Data and Semantic Web Technology . From our point of view, such structured, machine readable results should be part of every publication and considered as research artefact itself.
Structured Communication is the prerequisite for Tool-supported Quality Control. Due to the challenges of Replicability, Reproducibility, Comparability and Understandability, traditional peer-reviewing is severely limited in the case of ML. Experimental results often lack comparability due to missing or unoptimised baselines, different dataset splits, different evaluation measures and incomplete comparisons due to space constraints or other differences in the work. The technical complexity of experiments cannot be described to its full extent within a research paper and not every experiment can be reproduced by the community even if the software is available. Hence, the peer reviewing process needs to be extended in order to fulfil its quality control role. Given Structured Communication of results, reviewers can validate empirical results in a more complete manner without the need to reproduce the results or making guesses on assumptions taken in the experiments. Furthermore, such an enhanced peer-reviewing process would also be a quality check for the experiments conducted and hence experimental results could be assumed to be “correct” afterward.
Tool-supported Quality Control and Structured Communication, if done openly, also support enhanced State-of-the-Art Analysis. With structured results, researches can compare their own experiment with existing work without the need of reproducing previous state-of-the-art. Subsequently, Structured Results also provide the necessary background for meta-studies, which is well known from other research fields like for example Information Retrieval .
Finally, Structured Results also enhance Resource Management and ease the setup of new experiments. Datasets can be re-used as before and new parameter studies can be conducted by other research groups conveniently. In a high-end vision, ML research contributes to one, potentially decentralised, knowledge base comparable to research in genetics . To achieve this vision the legal circumstances have to be considered. While for some purposes anonymisation of data entries might be enough to abide to privacy laws, procedures for securing metrics like k-anonymity can affect the performance of ML processes . Finding the sweet spot between performance, privacy preservation, and legal possibilities is non-trivial. Additionally, a missing incentive to release and inherent competitiveness can discourage an open publication. This often results in research constricted by non-disclosure agreements. Such practices unfortunately contradict or at least hinder all three dimensions of Open Science, namely participation, access and transparency . Overcoming legal constraints is a problem for itself as presented by Patrick Andreoli-Versbach and Frank Mueller-Langer  and will take additional work.
5 Platforms for Open Science in Machine Learning
We now present existing frameworks in regard to their level of support for the key steps defined in our Open Science Process Model for ML research and in regard to the degree of transparency for experimental outcomes. In our preselection we distinguished between tools, which are small, special purpose utilities, platforms covering ML research processes, and standards for a structured exchange in our analysis. A list of the surveyed software can be found on zenodo.9
We focus on comparing integral platforms in order to identify an encompassing solution for the ML research process. Therefore, we omit presenting Tools like CARE , CDE  or ReproZip , which are constrained to specific sub tasks. While we have surveyed 40 tools, platforms, and standards, we have only included 11 of the most central platforms according to our analysis. Selection criteria include the prominence of tools measured by e. g. github stars, search trends, and mentions in publications as well as own practical experience, attribution to ML, and questioning of peer ML experts. Some recent or fringe solutions were additionally added even though they might be relatively unknown, to enable a more comprehensive overview of platforms. Limited resources prevented a sound analysis of all eligible tools, platforms and standards. Thus, this list as well as their interpretation should not seen as fixed or complete. A detailed systematic analysis is advised for further investigation. The next paragraphs will introduce the platforms we identify as central.
DeepDIVA DeepDIVA  is an open-source python framework used to make deep learning experiments reproducible. Internally it uses the PyTorch implementation for executing the experiments. DeepDIVA offers a range of boilerplate code for image processing, can document the execution of experiments, and conduct hyper-parameter optimisations. With the help of Tensorboard it also enables the visualisation of data and results.
FBLearner Flow FBLearner is a commercial ML platform enabling the definition of workflows for ML experiments. Model training can be conducted via managed pipelines. Each step in a workflow is interpreted as abstract components with well-defined inputs and outputs. All runs can be documented on their artefacts and metrics, as well as compared with each other. ActiVis as part of the Facebook’s ML platform is an interactive visualisation system developed to support the visual analysis of deep learning experiments. ActiVis is claimed to be the first tool simultaneously supporting instance- and subset-level inspections.
Kepler The java-based scientific workflow application Kepler  was designed to facilitate model creation, execution, and sharing as well as analyses in research. The platform is not only supposed to support computer science but a range of disciplines. Kepler enables the user to create a workflow by using a graphical user interface to arrange data source and analytical components of compiled code in sequences. This converges in an executable workflow producing result data.
OpenML OpenML  is an open source platform for sharing and organising ML source data and experiments. OpenML tries to simplify access to ML data, software, and results. By means of a web API it provides an interface to find and reuse results of ML model applications, retrieve data sets and their meta data as well as application code. A further goal of OpenML is to enable large-scale collaboration in real-time.
Polyaxon Polyaxon10 is a newly released platform for executing and managing ML experiments as well as their results. It superimposes the Kubernetes cluster11 to provide the means to conduct experiments.
ROHub ROHub  is a platform storing Research Objects (ROs), which are meant to aggregate and store digital artefacts and their corresponding context semantically. Using ROHub, researchers are intended to share their findings in a semantically explicit way.
Sacred Sacred  is an open source python framework which claims to facilitate experiment implementation, organisation, and execution in a comparatively non-intrusive way. Using code annotations and predefined simple structures, Sacred enables the means to impose structure for python scripts intrinsically. Additionally it offers the possibility to be evaluated during and affect the script run. The command line interface of Sacred provides a configuration system allowing alteration of parameters of the script execution. To ensure reproducibility Sacred allows to log a sequence of results and settings to the local file system or MongoDB.
Sumatra Like Sacred, Sumatra  is a python library with Command Line Interface (CLI), which tracks ML projects, but in contrast it represents a more enclosed system. When executing python scripts, it can use annotations to track internal states in the application. Sumatra can be additionally used to track input and output of non-python programs. For these means it requires prior integration of dependencies in the relevant programming language. To track data of the experiments, users need to configure input and output files. Sumatra facilitates replicability and reproducibility, but in full only on the original environment of the first execution.
Tira Testbed for Information Retrieval Algorithms  is a web framework implemented to provide the means for research evaluation. In general TIRA was not developed mainly for tracking ML research but aims to allow for general plagiarism detection and collaboration platform in science.
VisTrails VisTrails  manages scientific workflows. Its main objective is improving the scientific discovery process by offering visual aids. Additionally VisTrails experiments can be shared and their results uploaded by means of CrowdLabs.12
Zenodo Zenodo  is an open source platform allowing users to publish their research results with arbitrary digital artefacts. These artefacts can be source data, software, textual publications, or every other scientific resource possessing additional benefit for understanding or working with the release. Zenodo enforces no standardised formats, sizes, or licenses for the uploaded information. Each published entry gets a globally unique id allocated. This identificator can be used to reference scientific work released on Zenodo in subsequent works. The restriction-free data sharing practices of Zenodo allow for a flexible application of the platform in Open Science and third party tool integration via an API. However this leads to diversified and sometimes even unstructured representations. In current applications supporting the ML research process solutions like Zenodo are not employed sufficiently and predominantly it is relied on manual uploads.
6 Platform comparison
Regarding the challenges of structured science there exist a multitude of tools dealing with tasks like data persistence, model creation and execution, result and data visualisation, methodology description, and procedure recommendation. Respective platforms and kits can be highly specialised and lack the view or ease of integration in the big picture. Tools like VisTrails allow for comparing visualisations of different experiments and platforms like sacred track algorithmic settings, can optimise hyper-parameters or in some cases even generate experimental code from formal definitions. Other tools may only be partially suited for the field of ML. Kepler for example allows for descriptions of abstract workflows and lacks for common ML tasks.
In order to give a more structured overview, we have aligned 11 platforms with the introduced Open Science Process Model for ML in Table 1 and the core scientific principles outlined above in Table 2. Platforms are hereby given one of three levels. Checkmarks (✓) annotate a more positive rating indicating good applicability and features in the related task or principle. A swung dash (˷) is to be considered neutral, while a cross (✗) denotes a lower rating. The assignment has been conducted based on information found in publications and online resources. It is designed to provide a first overview and has to be taken to some extent as subjective and worthy of discussion. For a fully informed choice of platform a more detailed comparison of the applied systems is advised. Considering the complex nature of these alignments, decisions resulting in one of the three different levels tend to omit details and should be only taken as point of access to own evaluations.
Our exploration showed, that no platform as of today supports all requirements for our Open Science Process Model in full, but that each platform already realises a subset of them.
While a wide range of functionalities are offered by open source tools, the current ecosystem lacks a comprehensive platform enabling the full stack of ML research including core Open Science principles. We identified OpenML as broad platform supporting a multitude of documentation, sharing and analysis concepts. In comparison to other platforms and toolsets, OpenML tends to miss specialised features like tracking of single decisions in a ML algorithm or tool-supported discussion and visualisation. In our experience tools like Sacred form a promising and flexible approach, yet lack support for the process as a whole.
Policies like the preregistration of studies can combat publication bias as shown in the field of psychology. Such policies may be hard to implement for ML. While some fields tend to need long preparation for experiments, a common ML experiment can be conducted in a short time frame. Therefore defining an adequate preregistration period might be tricky. If the limit is too small, an evaluation could happen before preregistration. Nevertheless, the algorithmic nature of ML experiments allows for a certain standardisation which may not be applicable to other fields. In our point of view, establishing an open and transparent research process is essential in order to increase the efficiency of ML and DS research. Platforms and tools will be required to support this process including the extension of the current paper-based publication system with the publication of structured results and details of the individual data-analytical process. No platform or tool currently supports the complete process and further effort in developing such a tool-chain or platform is required for the future. Tools and platforms also need to consider legal constraints forming a barrier for transparency and open access. In general, an independent entity would be desirable offering an evaluation and documentation infrastructure. For the time being, open source software and standards like the discussed solutions need to be developed and incentivised or enforced by publishers. Further studies have to assess if the current release process of scientific research in ML may hinder its progress. It is to be discussed if not only technical but also a systemic issues contribute to the in chapter 1 described crisis of reproducibility. Possible solutions may include providing an initiative to overcome the pressure to publish and release reproducible high-quality science.
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS’16, pages 308–318. ACM, New York, NY, USA, 2016. Google Scholar
Michele Alberti, Vinaychandran Pondenkandath, Marcel Würsch, Rolf Ingold, and Marcus Liwicki. Deepdiva: A highly-functional python framework for reproducible experiments. CoRR, abs/1805.00329, 2018. Google Scholar
Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew B. Jones, Bertram Ludäscher, and Steve Mock. Kepler: an extensible system for design and execution of scientific workflows. In Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004, pages 423–424, 2004. Google Scholar
Timothy G Armstrong, Alistair Moffat, William Webber, and Justin Zobel. Improvements that don’t add up: ad-hoc retrieval results since 1998. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 601–610. ACM, 2009. Google Scholar
Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al.Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25, 2000. CrossrefGoogle Scholar
Steven P. Callahan, Juliana Freire, Emanuele Santos, Carlos Eduardo Scheidegger, Cláudio T. Silva, and Huy T. Vo. Vistrails: visualization meets data management. In SIGMOD Conference, 2006. Google Scholar
Fernando Seabra Chirigati, Rémi Rampin, Dennis Shasha, and Juliana Freire. Reprozip: Computational reproducibility with ease. In SIGMOD Conference, 2016. Google Scholar
Graham Dove, Kim Halskov, Jodi Forlizzi, and John Zimmerman. Ux design innovation: Challenges for working with machine learning as a design material. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI’17, pages 278–288. ACM, New York, NY, USA, 2017. Google Scholar
Benedikt Fecher and Sascha Friesike. Open science: one term, five schools of thought. In Opening science, pages 17–47. Springer, 2014. Google Scholar
Erin D Foster and Ariel Deardorff. Open science framework (osf). Journal of the Medical Library Association: JMLA, 105(2):203, 2017. Google Scholar
Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of data-oriented experiments in e-science (dagstuhl seminar 16041). Dagstuhl Reports, 6:108–159, 2016. Google Scholar
Tim Gollub, Benno Stein, Steven Burrows, and Dennis Hoppe. Tira: Configuring, executing, and disseminating information retrieval experiments. In 2012 23rd International Workshop on Database and Expert Systems Applications, pages 151–155, 2012. Google Scholar
Klaus Greff, Aaron Klein, Martin Chovanec, Frank Hutter, and Jürgen Schmidhuber. The sacred infrastructure for computational research. In Proceedings of the Python in Science Conferences-SciPy Conferences, 2017. Google Scholar
Robert Grossman, Simon Kasif, Reagan Moore, David Rocke, and Jeff Ullman. Data mining research: Opportunities and challenges. A report of three NSF workshops on mining large, massive, and distributed data, 1999.
Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Google Scholar
Philip J. Guo and Dawson R. Engler. Cde: Using system call interposition to automatically create portable software packages. In USENIX Annual Technical Conference, 2011. Google Scholar
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. CoRR, abs/1709.06560, 2017. Google Scholar
Yves Janin, Cédric Vincent, and Rémi Duraffort. Care, the comprehensive archiver for reproducible execution. In TRUST@PLDI, 2014. Google Scholar
Brewster Kahle, Rick Prelinger, Mary E Jackson, Kevin W Boyack, Brian N Wylie, George S Davidson, Ian H Witten, David Bainbridge, Stefan J Boddie, William A Garrison, et al.Public access to digital material; a call to researchers: Digital libraries need collaboration across disciplines; report on the first joint conference on digital libraries. D-Lib Magazine, 7(10):n10, 2001. Google Scholar
Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship. arXiv preprint arXiv:1807.03341, 2018. Google Scholar
Liz Lyon. Transparency: the emerging third dimension of open science and open data. Liber quarterly, 25(4), 2016. Google Scholar
Raúl Palma, Piotr Holubowicz, Óscar Corcho, José Manuél Gómez-Pérez, and Cezary Mazurek. Rohub—a digital library of research objects supporting scientists towards reproducible science. In SemWebEval@ESWC, 2014. Google Scholar
Ken Peffers, Tuure Tuunanen, Marcus A Rothenberger, and Samir Chatterjee. A design science research methodology for information systems research. Journal of management information systems, 24(3):45–77, 2007. Web of ScienceCrossrefGoogle Scholar
Isabella Peters, Peter Kraker, Elisabeth Lex, Christian Gumpenberger, and Juan Gorraiz. Zenodo in the spotlight of traditional and new metrics. In Front. Res. Metr. Anal., 2017. Google Scholar
Quan Pham, Tanu Malik, and Ian T. Foster. Using provenance for repeatability. In TaPP, 2013. Google Scholar
Vedran Sabol, Gerwald Tschinkel, Eduardo Veas, Patrick Hoefler, Belgin Mutlu, and Michael Granitzer. Discovery and visual analysis of linked data for humans. In International Semantic Web Conference, pages 309–324. Springer, 2014. Google Scholar
Erich Schubert and Michael Gertz. Numerically stable parallel computation of (co-) variance. In Proceedings of the 30th International Conference on Scientific and Statistical Database Management, page 10. ACM, 2018. Google Scholar
Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018. Google Scholar
Sören Sonnenburg, Mikio L Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, et al.The need for open source software in machine learning. Journal of Machine Learning Research, 8(Oct):2443–2466, 2007. Google Scholar
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luís Torgo. Openml: networked science in machine learning. SIGKDD Explorations, 15:49–60, 2013. Google Scholar
Kiri Wagstaff. Machine learning that matters. CoRR, abs/1206.4656, 2012. Google Scholar
Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al.The fair guiding principles for scientific data management and stewardship. Scientific data, 3, 2016. Google Scholar
About the article
Thomas Weißgerber works as research associate at the Chair of Distributed Information Systems at the University of Passau. In 2016 he obtained his M. Sc. at the University of Passau after already contributing in a multitude of national and EU funded projects. In extracts he conducted scientific work in visualization techniques, similarity metrics, semantic web, software engineering, privacy preserving technologies and machine learning.
Prof. Dr. Michael Granitzer holds the Chair of Data Science at University of Passau. After his MsC in 2004 he obtained a PhD degree, passed with distinction, in technical science in 2006. His research addresses topics in the field of Applied Machine Learning, Deep Learning, Visual Analytics, Information Retrieval, Text Mining and Social Information Systems. He published over 190 mostly peer-reviewed publications including journal publications, book chapters and books in the above-mentioned fields. Publications are available for download under http://mgrani.github.io/.
Published Online: 2019-04-26
Published in Print: 2019-08-27