CardioVINEdb: a data warehouse approach for integration of life science data in cardiovascular diseases

Summary One of the major challenges in bioinfomatics is to integrate and manage data from different sources as well as experimental microarray data and present them in a user-friendly format. Therefore, we present CardioVINEdb, a data warehouse approach developed to interact with and explore life science data. The data warehouse architecture provides a platform independent web interface that can be used with any common web browser. A monitor component controls and updates the data from the different sources to guarantee up-todateness. In addition, the system provides a “static” and “dynamic” visualization component for interactive graphical exploration of the data.


Introduction
Large amounts of high dimensional biological data are generated from different highthroughput experiments and from literature.The rapidly growing number of databases and data types poses the challenge of integrating the heterogeneous data, especially in biology.Currently there are about 1170 important molecular biology databases [1].
Thus, the challenge is to capture, model, integrate and analyze the data in a consistent way to provide a new and deeper insight into complex biological systems.
The huge quantity of information generated in life sciences is dispersed in many databases and repositories.Diverse integration approaches for molecular biological data sources have been developed.These systems are based on different data integration techniques.Several systems like Atlas [2], BioWarehouse [3], Columba [4], Systomonas [5] and Reactome [6] have been developed to integrate and present heterogeneous biological data.But most of these systems are not platform-independent and are implemented in different programming languages.Those systems take more time to install and make it difficult to determine whether data is up-to-date due to lack of logging information.Therefore, the key task is not only to integrate and manage the data scattered in different sources by creating a data warehouse, but also to create a method for visually representing the integrated data in a simple way to understand the underlying biological complexity with ease.

Design and Implementation
Based on the CardioWorkBench EU project we implemented a platform-independent data warehouse system that integrates multiple heterogeneous data sources into a local database enriched with protein microarrays from human smooth muscle cells that are related to cardiovascular diseases.Based on our VINEdb [8] information system we extended CardioVINEdb with more data sources, better data warehouse infrastructure including monitoring and microarray data.In addition, we upgraded the visualization components and web pages for better navigation and exploration.To ensure maximum up-to-dateness of the integrated data, we developed a data warehouse infrastructure including a monitor component.Furthermore, the common web-based user interface provides a visualization component that allows interactive exploration of the integrated data.
The CardioVINEdb system architecture consists of a 4-layer architecture that is illustrated in Figure 1.The source layer contains the multiple data sources BRENDA, EMBL, GO, IntAct, KEGG, MINT, OMIM, PubChem, SCOP, Transfac, Transpath and UniProt.In addition to the public available databases, we integrate experimental microarray data of human smooth muscle cells that are associated to cardiovascular diseases.Most of these databases provide parseable flat files that can be processed by our data warehouse infrastructure.A monitor component that is part of the integration layer controls the different data sources.It recognizes changes in the original sources and starts download if files changed.In a defined cycle the parser will be activated to start the ETL (Extraction-Transform-Load) process.ETL means that data is extracted form the source data, transformed into the data warehouse schema and loaded into the data warehouse.Data marts for specific analysis applications can easily be constructed by the database layer, i.e. the data warehouse.The web-based graphical user interface of CardioVINEdb is implemented with JavaServer Pages (JSP) and runs on an Apache Tomcat web server.Each data entry has detailed information and a further link to the original data source.A general search engine allows the user to find information of interest spanning multiple domains, such as proteins, enzymes, genes, compounds etc.Additionally, each domain has its own specific search engine to find required information for research.
For better understanding the relationships between the biological object, the networkbased visualization enables intuitive and comfortable exploration of the integrated data.
The images or the graphical representations of the relationships between the entities of the data warehouse created using JUNG at runtime are dynamic and interactive allowing further exploration.JUNG (http://jung.sourceforge.net/) is a Java-based library that provides classes to describe graphs, nodes and edges with additional layout preferences.Thus, a graph is generated by JUNG according to the entities selected by the user.The system produces a PNG image file with the graphical visualization of biological objects in different domains and their linkage.Finally, this image is embedded in the HTML pages and displayed by the web browser.The dynamic component using a Java Applet works in the same manner, but in this case the graph is directly generated and displayed in the applet.For more interactive navigation and exploration the applet has a zoom function, different graph layouts and a picking function to move and select nodes within the graph.Therefore the applet is embedded in the HTML pages and can be displayed by the web browser if Java Runtime Environment is installed on the computer.
The database management for integrated data is realized in MySQL using Java Database Connectivity (JDBC).The core of the data warehouse infrastructure, with the name BioDWH [7], is completely implemented in Java, which ensures platform independence of the operating system.Therefore, it could be used separately from the web interface.
BioDWH is a bioinformatics data warehouse software kit that integrates biological information from multiple public life science data sources into a local RDBMS.It provides up-to-date integrated knowledge, platform and database independence.This data warehouse infrastructure is available for interested scientific users as a SourceForge project (http://sourceforge.net/projects/biodwh/).

Summary
A major challenge in life sciences is the integration of heterogeneous data.Apart from the difficulty to facilitate the study of such data within the biological context, a fundamental problem is to represent and make the available knowledge accessible.CardioVINEdb provides integrated data from different popular life science databases and microarray data related to cardiovascular diseases from an EU project in a homogeneous web-based system.The system enables intuitive search of integrated life science data, simple navigation to related information as well as visualization of biological domains and their relationships.Equipped with a monitor component that updates the integrated data in defined update circles, it enables a way for presenting complex biological data in a very user-comprehensible manner.

Figure 1 :
Figure 1: Schematic representation of the CardioVINEdb 4-layer system architecture from the original heterogeneous data sources to the web application layer.