Big data and complex analysis workflows (pipelines) are common issues in data driven science such as bioinformatics. Large amounts of computational tools are available for data analysis. Additionally, many workflow management systems to piece together such tools into data analysis pipelines have been developed. For example, more than 50 computational tools for read mapping are available representing a large amount of duplicated effort. Furthermore, it is unclear whether these tools are correct and only a few have a user base large enough to have encountered and reported most of the potential problems. Bringing together many largely untested tools in a computational pipeline must lead to unpredictable results. Yet, this is the current state. While presently data analysis is performed on personal computers/workstations/clusters, the future will see development and analysis shift to the cloud. None of the workflow management systems is ready for this transition. This presents the opportunity to build a new system, which will overcome current duplications of effort, introduce proper testing, allow for development and analysis in public and private clouds, and include reporting features leading to interactive documents.
Bioinformatics, today, is supporting most biological and medical research projects. Bioinformatics examples will be used in the following as examples for all of data science. At the beginning, bioinformatics was mostly concerned with sequence alignment and it still is an important task. Additionally, many other tasks have developed ranging from statistical calculations to image or video analysis. In the last decades many bioinformatics tools have been developed. For example, sequence alignment can be done with a myriad of tools such as FASTA  and BLAST . In fact, for the task of sequence alignment at least 50 tools have been developed . This duplication of effort is also seen in other areas of bioinformatics. In mass spectrometry (MS), for example, there are at least ten tools for de novo sequencing of MS/MS spectra  and at least ten more for database search . It is impossible to stay on top of the most recent developments for a larger amount of tool categories. Collections such as JIB Tools  try to organize tools into categories, but it is a manual and time consuming task for the tool editors. The tool DaTo  has a very comprehensive collection of tools and databases accessible via an online interface. Tools and databases for DaTo are automatically discovered but not manually annotated. Therefore, the information in DaTo is more comprehensive, but it does not provide a quality assessment of the tools and databases.
1.1 Quality of Computational Tools
The quality of tools in bioinformatics is often hard to assess because gold standard datasets are not available or cannot be produced for a given problem . Even if good test data is available, often it is not used since formats are not agreed upon which exacerbates testing of newly developed tools. Some developments such as OpenMS  at least include unit tests for all their modules, but many bioinformatics tools do not include extensive testing. Instead, they may contain code smells , an indication that the software should be re-designed. Code smells come in many odours  for different types of design flaws such as Shotgun Surgery, which refers to the problem that for making a code change many parts of the project need to be changed at once. The aforementioned smell, Dead Code, and Divergent Code smells seem to be a common problem today arising from copy paste of code that work(s/ed) or appear(s/ed) to be. One reason for bioinformatics codes to be smelly is that bioinformaticians wear two heads, one for information science and one for biology or related fields. This problem has long been identified and in an attempt to overcome it, programming schools were initiated under the software carpentry umbrella .
1.2 Software Carpentry
Initiated out of frustration in 1998, software carpentry has become an organization reaching thousands of computational scientists. Its overarching aim is to teach basic lab skills for research computing. As of 2018, software and data carpentry merged their efforts (The Carpentries: https://carpentries.org/). According to their own words: “These Carpentries seek to build and grow communities of practice around computational skills development for researchers”. Today, there are about 500 accredited instructors teaching about one course on average and thereby reaching a total of around 16,000 participants. Many skills taught as part of computer science curricula are never formally disclosed to computational scientists of other disciplines who just happen to need programming skills for their daily tasks. According to Greg Wilson, a seminal figure for software carpentry, their courses increase participants’ computational skills by two-fold and make them more effective programmers in practice .
1.3 Programming Languages for Scientific Computing
1.4 Workflow Management Systems
While using a single tool for analysis, e.g. aligning a novel sequence to known sequences, is a relatively small task (ignoring ID conversions and the large amount of available databases), the complexity of data analysis is ever increasing. On the other hand, there is a call for reproducibility of data analysis. Workflow management systems (WMS) ensure reproducibility of complex data analysis tasks. Similar to other bioinformatics tools, a large variety of WMS are available. WMS used in bioinformatics are, for example, Taverna , Galaxy , and KNIME  but many others exist. Some WMS include collaborative workflow development with versioning (e.g. KNIME). Most WMS allow the incorporation of new tools. For some WMS that is relatively simple (Galaxy) and for others it may be a bit more involved (KNIME). While the production of workflows can be versioned, the tools incorporated in the workflow are generally not. However, this can be achieved using Cuneiform , which can execute a workflow while loading specific tool revisions from git repositories. Unit testing and integration testing is not generally a part of WMS although building computational data analysis pipelines needs proper testing just like building any other software artifact. There appear to be attempts to make Galaxy WFs testable . Galaxy and other WMS are running on a server and can be accessed via the internet. However, none of the WMS used in bioinformatics leverage the computational power of the server and the client or make online and offline development seamless. The internet of things has, among other developments, seen the rise of WMS to enable the analyses of sensor data.
One challenge in the future of scientific computing is the transfer form workstations and desktop computers to laptops or even smaller units. Additionally, the transition away from programs to online application will change the way scientific computing is performed today. At the same time, many programs used in scientific computing are not comprehensively tested and have many alternative forms e.g. developed in different programming languages. In the following, I will suggest a system which will overcome current issues in scientific computing and streamline development of new tools and workflows. This is a call for a science community effort but also inviting the commercial sector for collaboration.
The envisioned system will depend on a central application server, or several replicates thereof, to provide access to the IoS for development and use. For closed installations, as for example in companies, the application can be installed on a local server behind a company firewall. The application server holds the main application with several modes of accessing the system: (1) as administrator, (2) as reviewer, (3) as developer, (4) as tester, and (5) as workflow developer or user. Users and workflow developers only have access to successfully reviewed tools whereas in other access modes increasingly less strict access models apply. While access to level 5 is unrestricted, higher level access needs accreditation, for example, from providers such as ORCID. The application server does not hold data for workflows which should be linked via web-accessible resources and it does not perform computations for the user workflows. Instead, the execution of the workflows and the tools will occur elsewhere, on a trusted resource (e.g.: grid), or on all connected users which grant access to their web browser for computations, as well as public/restricted dedicated computation nodes for the IoS. In summary, the system consists of a replicated server for the main application and tools, name servers for user authentication, and different options to assign computational resources for data access and to perform computations. Thus, the IoS is a cloud based application which also facilitates access to distributed computing with the reviewed IoS tools on various cloud services. Thereby, it resembles a software as a service at least for the development of the IoS while the workflow development resembles a platform as a service structure.
It is especially important to prevent errors from propagating into critical data analysis workflows as might be used for clinical decision-making. Therefore, development and production will be separated and only workflows build solely from production level tools will be executable in the community workflow management system. Tools are promoted from the development system when they (a) pass synthetic and real world test scenarios and (b) pass the scrutiny of a self-/auto-assembled community review committee (Figure 2). Some incentives for performing reviews are to be able to in turn use the reviewed tools, extend them, and/or receive reviews for their own tools. Additionally, efforts such as developing test suites and or providing test data will be incentivised by associating them with citable digital object identifiers (DOIs).
Successfully reviewed tools (Figure 2), are part of the production level and can then be used to build data analytics workflows (Figure 3). Self-made or not reviewed tools can be used as well, for example, by their developers but should not be acceptable for publication or for any critical data analysis. This model does not hinder new developments but will speed them up. For example, a doctoral student does not need to develop handlers or connectors for commonly used data types but can simply build a workflow using existing tools. Adding their own algorithm to the mix as an extension of existing tools or as novel tools leads to novel workflows which can then be directly used or encapsulated as modules for use in production after passing due review. Modules are also very useful for automatic algorithm selection. For example, many algorithms have been developed for exact pattern matching  and their performance varies with the input such that the most effective algorithm can be selected automatically. Such a scenario can be encapsulated into a module and thereby a huge reduction in complexity can be achieved shielding the users of such modules from the decision making process of which particular algorithm to use in the specific situation. This also ensures that everyone contributing to solving a particular problem with different algorithms can re-use existing tests and can directly benchmark against all other reviewed solutions using a comprehensive search space, thereby adding to the problem solution without increasing apparent complexity. Furthermore, all meaningful portions of code are citable via DOIs making the code FAIR  similar to already established initiatives for FAIR data .
Reporting is an important part of a computational data analytics workflow. In fact, since the workflow already defines all inputs, data transformations, and outputs, reporting is fully traceable from any report item to the underlying raw data. The aim is to integrate workflow outputs into documents which can be collaboratively developed online. The outputs will allow direct access to the reasoning down to the raw data from within the document. This type of interactive document completely encapsulates the intent and the approach taken and, together with the embedded IoS workflows, ensures reproducibility.
For future development of the internet of science, a public repository has been initialized: https://bitbucket.org/allmer/ios/src/master/. Developers of the IoS system need to be approved to gain access but read access is public. The IoS will be presented several times in 2019 and towards the beginning of 2020, a conference is planned to bring together interested parties and to grow the community.
The main aim of the internet of science is to re-establish collaboration as the first principle of science. The IoS enables collaborative work on developing and testing software, developing and testing data analysis workflows, and joint reporting. It binds together all stakeholders while enabling the tracking of individual and joint efforts via DOIs. The IoS embraces FAIR code and appreciates FAIR data. Documents developed using IoS provide access to data, the IoS workflow for data analysis and in turn all code involved as well as to the document thus making information also FAIR.
Conflict of interest statement: Authors state no conflict of interest. All authors have read the journal’s Publication ethics and publication malpractice statement available at the journal’s website and hereby confirm that they comply with all its parts applicable to the present scientific work.
 Wikipedia. Sequence Alignment ToolsSequence. 2019. https://en.wikipedia.org/wiki/List_of_sequence_alignment_software. Search in Google Scholar
 Verheggen K, Martens L, Berven FS, Barsnes H, Vaudel M. Database Search Engines: Paradigms, Challenges and Solutions. Adv Exp Med Biol 2016;919:147–56.10.1007/978-3-319-41448-5_627975215 Search in Google Scholar
 JIB Tools. J Integr Bioinform. 2019. https://agbi.techfak.uni-bielefeld.de/JIBtools/. Search in Google Scholar
 Allmer J. A Call for Benchmark data in mass spectrometry-based proteomics. J Integr OMICS 2012;2:1–5. Search in Google Scholar
 Sturm M, Bertsch A, Gröpl C, Hildebrandt A, Hussong R, Lange E, et al. OpenMS – An open-source software framework for mass spectrometry. BMC Bioinformatics 2008;9:163.1836676010.1186/1471-2105-9-163 Search in Google Scholar
 Prlić A, Yates A, Bliven SE, Rose PW, Jacobsen J, Troshin PV, et al. BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics 2012;28:2693–5.2287786310.1093/bioinformatics/bts494 Search in Google Scholar
 Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, et al. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res 2013;41:W557–61.2364033410.1093/nar/gkt328 Search in Google Scholar
 Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Čech M, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res 2018;46:W537–44.10.1093/nar/gky37929790989 Search in Google Scholar
 Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, et al. KNIME: The Konstanz Information Miner. In: Preisach C, Burkhardt H, Schmidt-Thime L, Decker R, eds. Data analysis, machine learning and applications. Berlin, Heidelberg: Springer, 2008:319–26. doi:10.1007/978-3-540-78246-9_38. Search in Google Scholar
 Reiser L, Harper L, Freeling M, Han B, Luan S. FAIR: a call to make published data more findable, accessible, interoperable, and reusable. Mol Plant 2018;11:1105–8.10.1016/j.molp.2018.07.005 Search in Google Scholar
 Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004;32:D258–61.1468140710.1093/nar/gkh036 Search in Google Scholar
© 2019, Jens Allmer, published by Walter de Gruyter GmbH, Berlin/Boston
This work is licensed under the Creative Commons Attribution 4.0 Public License.