In biology and functional genomics in particular, understanding the dependence and interplay between different genome and ecological characteristics of organisms is a very challenging problem. There are some public databases which combine this kind of information, but there is still much more information about microbes and other organisms that reside in unstructured and semi-structured documents, such as encyclopaedias. In this paper we present a method for extracting information from semi-structured resources, such as encyclopaedias, based on finite state transducers, consisting of two clearly distinguished phases. The first phase strongly relies on the analysis of the document structure and it is used for locating records of data in the text. The second phase is based on the finite state transducers created for extracting the data, which can be modified so as to achieve the preferred efficiency and it is used for extracting the particular characteristic from the text. We show how the two phase method is applied to the text of the encyclopaedia “Systematic Bacteriology”. A fully structured database with genotype and phenotype characteristics of organisms has been created from the encyclopaedia unstructured descriptions.
© 2011 The Author(s). Published by Journal of Integrative Bioinformatics.
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.