To support the interpretation of measured molecular facts, like gene expression experiments or EST sequencing, the functional or the system biological context has to be considered. Doing so, the relationship to existing biological knowledge has to be discovered. In general, biological knowledge is worldwide represented in a network of databases. In this paper we present a method for knowledge extraction in life science databases, which prevents the scientists from screen scraping and web clicking approaches.
We developed a method for extraction of knowledge networks from distributed, heterogeneous life science databases. To meet the requirement of the very large data volume, the method used is based on the concept of data linkage graphs (DLG).We present an efficient software which enables the joining of millions of data points over hundreds of databases. In order to motivate possible applications, we computed networks of protein knowledge, which interconnect metabolic, disease, enzyme and gene function data.
The computed networks enabled a holistic relationship among measured experimental facts and the combined biological knowledge. This was successfully applied for a high throughput functional classification of barley EST and gene expression experiments with the perspective of an automated pipeline for the provisioning of controlled annotation of plant gene arrays and chips.
Availability: The data linkage graphs (XML or TGF format), the schema integrated database schema (GML or GRAPH-ML) and the graph computation software may be downloaded from the following URL: http://pgrc.ipk-gatersleben.de/dlg/