Dramatic reductions in sequencing costs and improvements of sequencing technologies have led to a higher throughput of sequencing data and to a burst of dual transcriptomic and metagenomic studies in much shorter time over the last years. To understand and to detect key species in these datasets, detailed insights into the data are necessary. Therefore, researchers need access to intuitive visualization tools to obtain a global overview of the data. Commonly used tools include Krona , MEta Genome ANalyzer (MEGAN) , iTOL  and VAMPS . These tools can visualize the taxonomic composition of the datasets by exploring the abundances on a species-level or on a broader taxonomic category, whereas in iTOL a phylogenetic clustering approach is used. For single organisms, the Integrative Genome Viewer (IGV) , Circos  and Tablet  are commonly used tools to visualize integrated genomic datasets.
In contrast to pie chart visualizations as offered in Krona, and to bar chart plots as integrated in MEGAN, Sankey diagrams are a good alternative to visualize gene expression data or microbial community compositions over time. Sankey diagrams are flow diagrams, in which the arrow width is proportional to the quantity (e.g. gene expression) to depict changes over time or hierarchy between nodes. These diagrams can indicate the increase or the decrease of data elements in two or more time points. Sankey diagrams are also commonly used in other research areas to highlight changes over time, e.g. in eye dynamics , medical records , energy flows in cities , energy efficiency  and voter transition .
As the costs for sequencing data decreases, more time series data are produced requiring intuitive methods for the visualization of the respective data and to support researchers that have no direct knowledge of programming languages. For the analysis of genes, Sankey plots are an ideal visualization method to detect candidate genes with a similar expression profile. Especially, when interlinked with functional descriptions, such as Interpro , gene ontology categories (GO, ), conserved orthologous groups (COG, ), protein families (PFAM, ) or KEGG  to explore the biological context of the genes, these types of visualization provide the basis to obtain a global overview.
BioSankey uses Sankey plots to analyze microbial communities, both on species and taxonomic level to inspect gene expression along different time points. The user can analyze lists of differentially expressed genes (DEGs) by inspecting the expression transitions over time and functionally by offering customized queries on the data. Additionally, the tool can be easily embedded into an existing analysis workflow. With BioSankey, we provide a tool for functional queries to search for and visualize key genes, with an additional export function to allow the integration of these plots into publications.
2 Materials and Methods
2.1 Import of Data into BioSankey and Generation of the Project-Specific Website
The absolute or relative abundance of an element type (e.g. normalized microbial species abundance or gene expression) is provided by the user as a comma separated file, where a row depicts the microbial species or gene and the second column contains the description of the gene or operational taxonomic unit (OTUs) followed by the normalized read counts per condition or optionally, the expression when genes should be analyzed. These identifiers can be used to query the dataset in the website. Optionally, time series data of the fluctuation of up- and down-regulated DEGs can be provided by the user (e.g. first versus second time point), which we will describe in use case 4. The whole functionality of BioSankey is summarized in Figure 1, starting from the specification of the possible input files, the integration of the information into a web page and finally the generation of the web page. We have created a graphical user interface (GUI) to guide the user throughout the configuration process and to automatically execute the python script for generating the project-specific web page. The python scripts are implemented under version 3.6 and require no additional packages and should work both on Linux and Windows.
2.2 Visualization of Elements in the HTML Site
The user can select different numbers of genes to be shown by the filter criteria. All genes are highlighted as Sankey diagrams, whereas if more than the defined amount of genes are searched, genes must be selected from a selection box. The search panel contains three visualization modes: In the “METAGENOME” mode, microbial communities are visualized from the highest taxonomic unit to the lowest taxonomic unit, a feature which is inspired by tools such as Krona and MEGAN to allow to analyze the abundances of each bacterium and to visualize the abundance over time. In addition, if the data is provided, a user can select “DEG categories”, where an overview of transitions of up- and down-regulated genes between consecutive time points is given, whereas in the mode “GENES” all genes are visualized and can be selected.
3 Results and Discussion
3.1 General use of BioSankey
3.2 Comparability to Other Tools
We have compared BioSankey to other tools, which allow to visualize the abundances of species or taxa, such as Krona  and iTOL . An overview of the functionality that BioSankey provides in comparison to the two other tools is given in Table 1. While BioSankey allows to visualize taxonomical and time-series data, Krona and iTOL do not allow to visualize time-series data but offer a broad range of export functionalities. The advantage of BioSankey is to visualize the expression of single genes or selected species, which is so far not supported in the other tools.
Comparison of the three tools BioSankey, Krona and iTOL.
|Time series data visualization||Yes||No||No|
|Various export possibilities||No||Yes||Yes|
|Highlighting of selected genes (e.g. DEGs)||No||Yes||Yes|
|Search function to find genes/taxa of interest||Yes||No||Yes|
3.3 Microbial Community Analysis
When metagenomic studies are considered, reads are often assembled with tools such as SSpace  and then enter a binning approach with help of tools such as Maxbin  or Concoct , which group contigs into Bins based on their tetranucleotide frequencies and additional intrinsic features. As an alternative, more often, the sequencing of the 16S rRNA gene is used to assess the abundance of the species in a metagenomics dataset. Thereby, sequences are clustered based on 97% sequence identity to form OTUs, which can be analyzed with resources such as the SILVA server  after being processed with Mothur  and QIIME . To demonstrate BioSankey for analyzing microbial communities, we have used data generated with QIIME from , comprising time-series of microbial communities from different human tissues. The goal of this study was to get insights about variability in these tissues and to define potential core microbiomes. The authors of this paper use line- and pie-charts, scatter-plots, principal component analysis (PCA) and area charts, the latter two especially for showing time variation. PCAs over time were visualized in a video. In addition to this extensive collection of visualization, we added two alternative visualization options. First, we selected the tongue tissue and extracted all information on genus level and used BioSankey to generate a project-specific HTML site. Out of 373 genera, 250 of them had support by at least one read and were used for BioSankey. This is shown in Figure 2A. In comparison, we made also a Krona plot for the same data (Figure 2B). With the Sankey diagrams however, we can also visualize the changes of the microbiome over time in one diagram, while Krona plots have the advantage, that broader or very detailed taxonomic hierarchies can be depicted.
Further, we used the first six time points on genus level of the tongue microbial inhabitants for BioSankey (Figure 3). This is a partly replacement of the figure on genus level in Additional File 10 in the paper of Caporaso et al. . The Sankey plot visualization contains more information (the labeling) and is more appealing to the reader, with the drawback of being infeasible to show all 396 time points at once. In this case, the BioSankey plots might be the right choice to zoom in or show a selection of up to 20 time-points or higher, depending on the used computer screen or resolution.
3.4 Differential Gene Expression Visualizations Embedded into BioSankey
To demonstrate BioSankey, we used published gene expression time-series data to describe the effects of Camptothecin in U87-MG cell lines by integrating up- and down-regulated genes over time . Camptothecin is a drug, that specifically targets topoisomerase I. In this study, the authors showed the effects of Camptothecin to two glioblastoma cell lines (U87-MG and DBTRG-05). This is important to assess the use of this drug for malignant gliomas. With the gene expression in hand the authors could infer the affected pathways and assess the changes over time by considering expression data of six time points (2, 6, 16, 24, 48 and 72 h). By using BioSankey, we show, that Sankey plots can be used to visualize the reported ∼80% of down-regulated genes in U87-MG even in a time-resolved manner (Figure 4).
As a general overview of (mode: “DEG”) we provide a feature in BioSankey to highlight the amount of differential expressed genes as shown in Figure 4. The genes at each time point are filtered for up- and down-regulated expression (in the example a minimal fold-change of 2). In Figure 4, all transitions of genes between these states (up-, down- or low-regulated) are shown in each time point.
For a detailed analysis, the genes of a transition (e.g. all genes up-regulated at 2 h and at 6 h) can be selected and visualized separately. To obtain an overview visualization, a user must provide lists of DEG for each time point in a directory. From the visualization we can observe, that only a certain fraction of the genes (64 of 408) is up-regulated already at the first hours and only 41 of them are differentially expressed at the later time points. Furthermore, a user can select a time-point of genes that are up-, down- or not differentially expressed and can then extract the respective genes and visualize their expression to find a candidate gene or to obtain a general overview of the functionality of these genes.
We have established BioSankey as a tool, which offers an alternative way to analyze gene expression and the abundances of metagenomics datasets over time by using interactive Sankey diagrams, functional enrichment analysis and overview panels without the requirement of web servers and databases. We demonstrate the possibilities to interactively view the data for an efficient analysis. Biosankey is a valuable tool to get insights and understand the complexity of different datasets, from a high-level view of gene numbers to the genes, and in case of metagenomics, from a high taxonomic level down to strain level. This tool is important for researchers, who want to analyze the taxonomic composition of bacterial species in metagenomes by e.g. also selecting broader taxonomic categories. As an additional feature of the tool data exchange with collaborations is easily accomplished. For the dual RNA-seq experiment, BioSankey might be especially powerful when hosts and microbes are compared to each other or to efficiently detect interesting candidate genes. Therefore, we have integrated various criteria, which are based on domain functionalities or gene description information.
5 Material Availability
The software is available at the Github repository https://github.com/nthomasCUBE/BioSankey.
Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12:385.
Letunic I, Bork P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 2016;44:W242–5.
Huse SM, Mark Welch DB, Voorhis A, Shipunova A, Morrison HG, Eren AM, et al. VAMPS: a website for visualization and analysis of microbial population structures. BMC Bioinformatics. 2014;15:41.
Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29:24–6.
Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, et al. Circos: an information aesthetic for comparative genomics. Genome Res. 2009;19:1639–45.
Milne I, Bayer M, Stephen G, Cardle L, Marshall D. Tablet: visualizing next-generation sequence assemblies and mappings. Methods Mol Biol. 2016;1374:253–68.
Burch M, Kull A, Weiskopf D. AOI rivers for visualizing dynamic eye gaze frequencies. Computer Graphics Forum. 2013;32:281–90.
Huang C-W, Lu R, Iqbal U, Lin SH, Nguyen PA, Yang HC, et al. A richly interactive exploratory data analysis and visualization tool using electronic medical records. BMC Med Inform Decis Mak. 2015;15:92.
Chen S, Chen B. Coupling of carbon and energy flows in cities: a meta-analysis and nexus modelling. Appl Energy. 2017;194:774–83.
Dietmair A, Verl A. A generic energy consumption model for decision making and energy efficiency optimisation in manufacturing. Int J Sustain Eng. 2009;2:123–33.
Fieldhouse EA, Prosser C. When attitudes and behaviour collide: how the Scottish independence referendum cost labour. SSRN Electronic J. 2016. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2770996.
Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, et al. InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform. 2002;3:225–35.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–9.
Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28:33–6.
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42:D222–30.
Icay K. Director: A dynamic visualization tool of multi-level data. R package version 1.4.0. 2017.
Bojanowski M. Creating Alluvial Diagrams.
Weiner J. riverplot: Sankey or Ribbon Plots. 2017.
Sievert C, Parmer C, Hocking T, Chamberlain S, Ram K, Corvellec M, et al. plotly: create interactive web graphics via ‘plotly. js’. R package version. 2016;3.
Mauri M, Elli T, Mauri M, Uboldi G, Azzi M. RAWGraphs: a visualisation platform to create open outputs. in Proceedings of the 12th Biannual Conference on Italian SIGCHI Chapter. 2017. ACM.
Bogart S. SankeyMATIC 2018.
Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011;27:578–9.
Wu YW, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32:605–7.
Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11:1144–6.
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013;41:D590–6.
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–41.
- Export Citation
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–41.)| false 10.1128/AEM.01541-09 19801464
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010;7:335–6.
Caporaso JG, Lauber CL, Costello EK, Berg-Lyons D, Gonzalez A, Stombaugh J, et al. Moving pictures of the human microbiome. Genome Biol. 2011;12:R50.
Morandi E, Severini C, Quercioli D, D’Ario G, Perdichizzi S, Capri M, et al. Gene expression time-series analysis of camptothecin effects in U87-MG and DBTRG-05 glioblastoma cell lines. Mol Cancer. 2008;7:66.