BioSankey: Visualizing microbial communities and gene expression data over time

Metagenomics, RNA-seq, WGS (Whole Genome Sequencing) and other types of next-generation sequencing techniques provide quantitative measurements for single strains and genes over time. To obtain a global overview of the experiment and to explore the full potential of a given dataset, intuitive and interactive visualization tools are needed. Therefore, we established BioSankey, which allows to visualize microbial species in microbiome studies and gene expression over time as a Sankey diagram. These diagrams are embedded into a project-specific HTML page, that contains all information as provided during the installation process. BioSankey can be easily applied to analyse bacterial communities in time-series datasets. Furthermore, it can be used to analyse the fluctuations of differentially expressed genes (DEG). The output of BioSankey is a project-specific HTML page, which depends only on JavaScript to enable searches of interesting species or genes of interest without requiring a web server or connection to a database to exchange results among collaboration partners. BioSankey is a tool to visualize different data elements from single and dual RNA-seq datasets as well as from metagenomes studies.


Introduction
Dramatic reductions in sequencing costs and improvements of sequencing technologies have led to a higher throughput of sequence material in much shorter time and led to a burst of dual transcriptome experiments and metagenome studies over the last years. To understand and to find key genes in these datasets, detailed insights into this kind of data is necessary. Therefore, researchers need access to intuitive visualization tools to obtain a global overview of the data. As the costs for sequencing data decreases, more time series data are produced requiring intuitive methods for the visualization of the respective data and especially for researchers that have no direct knowledge or access to programming languages. For the analysis of genes, Sankey plots are an ideal visualization method to detect candidate genes with a similar expression profile.
Especially, when interlinked with functional descriptions, such as Interpro (Mulder et al. 2002), gene ontology categories (GO, (Ashburner et al. 2000)), conserved orthologous groups (COG, (Tatusov et al. 2000)), protein families (PFAM, (Finn et al. 2014)) or KEGG (Kanehisa & Goto 2000) to explore the biological context of the genes, these types of visualization provide the basis to obtain a global overview. In our tool, BioSankey, we used Sankey plots to analyze microbial communities, both on species and taxonomic level to inspect gene expression along different time points. Furthermore, we can analyze lists of differentially expressed genes (DEGs) by inspecting the expression transitions over time and also functionally by offering customized queries on the data. Additionally, the tool can be easily embedded into an existing analysis workflow. With BioSankey, we provide a tool for functional queries to search for and visualize key genes, with an additional export functions to allow the integration of plots into publications.

Import of data into BioSankey and generation of the HTML site
The absolute or relative abundance of an element type (e.g. abundance per population, or gene) is provided by the user as a text file or Microsoft Excel file, where a row depicts the data element and the column a time point. There is an additional second column where, optionally, the description of the gene or OTUs can be specified. These identifiers can be then used to query the dataset. Time series data of the fluctuation of up-and down-regulated DEGs can be also provided by the user as multiple files covering the pairwise comparisons between consecutive time points (e.g. first versus second time point), which we will describe in the Use case 1. The whole functionality of BioSankey is summarized in Figure 1 starting from the specification of the possible input files, the integration of the information into a HTML site with help of a customized python script and finally the generation of the HTML site and PDF exports.
We use one central script to extract the input data such as the abundancy data and functional annotations and optionally DEG lists to generate project-specific HTML pages containing Sankey diagrams based on the Google API. The HTML page includes features to query genes or bacterial species if this data is provided. By using Javascript, this allows to generate visualizations directly without requiring a webserver in the background. Each generated HTML plot can be then exported as a PDF, either silently at the command line or afterwards interactively adjusted.

Integration of functional information for genes and species
A user can integrate functional information for genes by adding a mapping table during the generation of the website, which allows to search for domain descriptions in the search panel.
Thereby, it is possible to search e.g. for a Pfam, GO identifier or for any other integrated descriptive information. In case, that a user provides abundancies of taxonomic species it is possible to select broader taxonomic categories within the visualizations just by entering e.g. a phylum or species identifier. Then, the assignment to taxonomic groups has to be provided in the second column of the abundancy matrix. To this end, a user must provide the taxonomic unit for each operational taxonomic unit (OTU). Instead of a gene identifier, functional information, such as the overlapping gene identifier can be used. Also, in case that differentially expressed genes between different time points were integrated, the tool allows to select those, that are differentially expressed between all timepoints or between selected pairwise time points (e.g. 6h and 12h), supposed this information is provided by the user.

Visualization of elements in the HTML site
The HTML file contains three different panel views. The first panel is the input box (Supplemental Figure 1), where a user can provide genes of interest (separated by a comma) or can select a functional domain. If no domain information was provided, no selection box is visible. The second panel contains the genes that were previously selected by the search queries.
If up to ten genes were selected by the filter criteria, all of them are highlighted as Sankey diagrams, whereas if more than ten genes are in the query set, genes must be selected manually from a selection box, that is embedded in the website. This should provide a good overview of the abundance changes. The third panel contains three visualization modes: A user can select 'DEG categories', where an overview of transitions of up-and down-regulated genes between consecutive time points is given, whereas in the mode 'GENES' all genes or shown and can be selected. In the 'METAGENOME' option, microbial communities are visualized from the highest taxonomic unit to the lowest taxonomic unit, a feature which is inspired by the tools such as Krona and MEGAN to allow to analyse the abundancies of each bacteria and to visualize the abundancy over time. Furthermore, with the input box also selected bacteria can be chosen leading to the update of the 'METAGENOME' mode.

General use of BioSankey
BioSankey can visualize input data from two different data sources: i) gene expression data, or ii) microbial data. In general, the tool can also be used for any other dataset that contains quantitative data from different time points. When integrated into BioSankey, it is possible to infer the fluctuations in abundances in microbial species originating from metagenome projects.
These different input data can be then used as input for generating a project-specific HTML site by making use of JavaScript to enable an intuitive and interactive selection and visualisation of all data elements, but also of particular filter criteria by functional description or selection of genes of interest. When the HTML site is generated, the project page, which contains the entire project with all expression information can be exchanged with collaboration partners, as no webserver or database is required. Figure 1 provides the workflow of the tool where two different datasets can be integrated and are then summarized with customized python scripts to generate a HTML page. When the HTML page is generated, we offer an export function for the tool to generate Sankey diagrams in the PDF format.

Comparability to other tools
We have compared BioSankey also to other tools, which also allow to visualize the abundances of species or taxa such as Krona and iTOL. An overview of the functionality that BioSankey provides in contrast to the other tools is given in Table 1. Whereas BioSankey allows to visualize taxonomical visualisation and time-series data, Krona and iTOL don't allow to visualize timeseries data, but offer a broad range of export functionalities. In contrast to the other tools, BioSankey however allows also to visualize the expression of single genes or selected species, which is so far not supported in the other tools.

Differential gene expression visualizations
To demonstrate BioSankey, we used published gene expression time series data to describe the effects of Camptothecin in U87-MG cell lines to illustrate our feature for visualizing up-and down-regulated genes over time. Camptothecin is a drug, that specifically targets topoisomerase I (Topo I). In their study, authors reported time-related changes and cell line specific changes of gene expression after Camptothecin treatment by considering expression data of six time points (2, 6, 16, 24, 48 and 72h). As a general overview of differentially expressed genes (mode: 'DEG') we provide a feature in BioSankey to highlight the amount of differential expressed genes as shown in Figure  down-or not differentially expressed and can extract then the respective genes and visualize their expression to find a candidate gene or to obtain a general overview of the functionality of these genes.

Microbial community analysis
When metagenomic studies are considered, reads are often assembled with tools such as SSpace

Conclusion
We have established BioSankey as a tool, which offers an alternative way to analyse gene expression or the abundances of metagenomics datasets over time by using Sankey diagrams, functional enrichment analysis and overview panels without the requirement of dependences of web servers and databases. We demonstrate the possibilities to interactively view the data for an efficient analysis. Biosankey is a valuable tool to get insights and understand the complexity of different datasets, from a high-level view of gene numbers to the genes, and in case of metagenomics, from a taxonomic high level down to strain level. This tool is important for researchers, who want to analyse the taxonomic composition of bacterial species in metagenomes by e.g. also selecting broader taxonomic categories. As an additional feature of the tool data exchange with collaborations is easily accomplished. For the dual RNA-seq experiment, BioSankey might be especially powerful when hosts and symbionts are compared to each other or in order to efficiently detect interesting candidate genes. Therefore, we have integrated various criteria, which are based on domain functionalities or gene description information.

Competing interests
The authors declare no competing interests.

Material availability
The software is available at the Github repository https://github.com/nthomasCUBE/BioSankey.  P  e  y  r  u  c  D  ,  P  o  n  t  i  n  g  C  P  ,  S  e  r  v  a  n  t  F  ,  S  i  g  r  i  s  t  C  J  ,  a  n  d  I  n  t  e  r  P  r  o  C  .  2  0  0  2  .  I  n  t  e  r  P  r  o  :  a  n  i  n  t  e  g  r  a  t  e  d   d  o  c  u  m  e  n  t  a  t  i  o  n  r  e  s  o  u  r  c  e  f  o  r  p  r  o  t  e  i  n  f  a  m  i  l  i  e  s  ,  d  o  m  a  i  n  s  a  n  d  f  u  n  c  t  i  o  n  a  l  s  i  t  e  s  .  B  r  i  e  f  B  i  o  i  n  f  o  r  m   3  :  2  2  5  -2  3  5  .   O  n  d  o  v  B  D  ,  B  e  r  g  m  a  n  N  H  ,  a  n  d  P  h  i  l  l  i  p  p  y  A  M  .  2  0  1  1  .  I  n  t  e  r  a  c  t  i  v  e  m  e  t  a  g  e  n  o  m  i  c  v  i  s  u  a  l  i  z  a  t  i  o  n  i  n