In silico strain optimization by adding reactions to metabolic models

Summary Nowadays, the concerns about the environment and the needs to increase the productivity at low costs, demand for the search of new ways to produce compounds with industrial interest. Based on the increasing knowledge of biological processes, through genome sequencing projects, and high-throughput experimental techniques as well as the available computational tools, the use of microorganisms has been considered as an approach to produce desirable compounds. However, this usually requires to manipulate these organisms by genetic engineering and/ or changing the enviromental conditions to make the production of these compounds possible. In many cases, it is necessary to enrich the genetic material of those microbes with hereologous pathways from other species and consequently adding the potential to produce novel compounds. This paper introduces a new plug-in for the OptFlux Metabolic Engineering platform, aimed at finding suitable sets of reactions to add to the genomes of selected microbes (wild type strain), as well as finding complementary sets of deletions, so that the mutant becomes able to overproduce compounds with industrial interest, while preserving their viability. The necessity of adding reactions to the metabolic model arises from existing gaps in the original model or motivated by the productions of new compounds by the organism. The optimization methods used are metaheuristics such as Evolutionary Algorithms and Simulated Annealing. The usefulness of this plug-in is demonstrated by a case study, regarding the production of vanillin by the bacterium E. coli.


Introduction
In a world where the markets pressure, globalization and the shortage of some natural resources have a large impact on the world economy, it is urgent to find alternative processes to produce compounds of interest, as well as to increase the productivity of existing processes.Biotechnological processes, involving the use of selected microorganisms to produce substances with industrial interest, are becoming an alternative to tradicional processes.To make these processes competivite, it becomes crucial to have a deep knowledge of the biological processes ocurring in these microbes, a task made possible by the recent genome sequencing projects and other experimental techniques, together with the boost of the Bioinformatics field to handle these new data.One interesting approach involves making use of computational tools to simulate the microbe's behavior in silico, before the application of genetic or enviromental manipulations in vivo [1].
An important challenge in Metabolic Engineering (ME) [2] consists in the identification of genetic manipulations to be applied to an organism, with the aim of constructing a mutant strain able to produce compounds of industrial interest.Based on the knowledge about the biological system and, more specifically, its metabolic network, we can manipulate the environment in which it develops, or alter it genetically, to maximize the production of a given compound [3].
Recently, advances have been achieved concerning the available knowledge of some biological organisms, for instance from the sequencing of their genomes and also from various types of high-throughput experimental data (e.g.gene expression, proteomics).However, the lack of tools to perform the analysis and interpretation of biological data still limits the use and interconnection of that knowledge [4].
In this context, OptFlux (http://www.optflux.org)[5] is an open-source and modular platform for ME, incorporating strain optimization tasks, using Evolutionary Algorithms (EAs) [6] and Simulated Annealing (SA) [7].OptFlux also allows the use of stoichiometric metabolic models for phenotype simulation of both wild-type and mutant organisms, Metabolic Flux Analysis and pathway analysis using Elementary Flux Modes, among other features.
When performing strain optimization, some limitations arise from the fact that metabolic models are incomplete [8] or the desired product can not be produced by the organism.In both cases, it will be necessary to find reactions to add to the metabolic model.In this paper, we present a new plug-in for OptFlux that allows to incorporate a set of reactions from an external database into an existing metabolic model performing phenotype simulation using those added reactions.Also, optimization methods will be put forward to allow the selection of the best set of reactions to add to the model, according to a given objective function (e.g.maximizing the production of a compound or filling gaps in the model).
The major concern, during the optimization task, is the large amount of reactions in the external database.If the number of reactions is high, the number of possible combinations increases exponentially and therefore it is crucial to reduce the number of reactions before starting the optimization process.So, methods to reduce and filter the database given a desired product are also developed in this work.

Methods for phenotype simulation and strain optimization
The simulation process allows the prediction of the organism phenotype, using methods based on fundamental restrictions to the biological system.One of these methods is Flux Balance Analysis (FBA) [9,10], that calculates the flux distribution making it possible to predict the growth rate of an organism or the rate of production of a metabolite, based on stoichiometric, reversibility and fluxes constraints.FBA assumes that metabolic networks will reach a steady state constrained by the stoichiometry.
Predicting the metabolic state of an organism after a genetic manipulation (e.g.gene knockout) is a challenging task, because mutants are generally not subjected to the same evolutionary pressure that shaped the wild type.In these cases, other methods such Minimization of Metabolic Adjustment (MOMA) [11] and Regulatory On/Off Minimization of metabolic fluxes (ROOM) [12] are proposed to find a flux distribution for mutant strains.
Based on these methods, a question arises: how to find the ideal set of genes to be deleted to reach the desired phenotype?To try answer this question, the OptGene algorithm proposed by the authors' research group proposed a set-based representation that considered variable-sized solutions, allowing for solutions with different numbers of knockouts during the optimization process.Two optimization algorithms were developed: SA and Set-based EAs (SEAs).Both search for the optimum set size in parallel with the search for the optimum set of gene deletions.This work aims to enlarge the set of possible genetic modifications by addressing gene additions.In this case, using SEAs or SA approaches, the optimization process finds a set of new reactions to be added to the model.Optionally, a complementary set of reactions to remove can also be optimized.Optimization methods are the same that were used previously.The main difference lies in the representation of the solutions.Although still using a representation based on sets, it is necessary to integrate information regarding the reactions to be added.Thus, a new way of representing solutions including two independent sets (knockouts and added reactions) was created.In Figure 1, the representation of one solution is depicted.For the new solution representation used in SA and EAs algorithms, it is necessary to develop new reproduction operators: crossover and mutation.These operators only modify one of the sets (knockouts or added reactions) in the genome, randomly selected.The new used operators are: • Random Mutation: replaces an element of the set by another; • Grow Mutation: introduces a number of new elements into the set, whose values are randomly generated; • Shrink Mutation: removes a number of randomly selected elements from the set; • Crossover: the genes that are present in both parent sets are kept in both offspring; the genes that are present in only one of the parents are sent to one of the offspring, selected randomly with equal probabilities.

OptFlux plug-in for adding reactions
A new plug-in was developed for OptFlux to allow the addition of external reactions to a metabolic model.The addition of new reactions can be made for phenotype simulation or to conduct a strain optimization process.Methods to import, filter and visualize the external database of reactions are also available.In the OptFlux platform the new functionalities can be accessed by the "Plugins/ Add Reactions" menu.

Import database of reactions
Importing an external database of reactions into OptFlux can be made using the same methods used for creating metabolic models (SBML [15] and flat text files).Also, a new format of text files is defined (details are in the site documentation) to allow a more flexible scheme.When using this format, the user can filter the input data files to select only reactions that satisfy some restrictions.This is useful for readability and to reduce the search space in the optimization tasks.In Figure 2, the application of two filters to a database is shown.After applying filters, the user obtains a set of reactions that will be imported to the OptFlux platform.The reactions database size can be reduced by applying methods to keep only the reactions that may lead to the production of the target compound.

Filter database by target compound
As stated before, the major problem in finding the reactions set to add to the metabolic model is related with the search space size, i.e. the number of reactions contained in the external database.To overcome this obstacle, methods were developed to filter the reactions database.Given the target product, the algorithm selects, from the database, only the reactions present in pathways between metabolites present in the metabolic model and the target product.The user can manipulate the number of selected reactions through a tolerance parameter.
The main steps of the algorithm are the following: • construct a structure that represents the dependencies between the target product and metabolites belonging to the metabolic model.All metabolites that belong to the metabolic model have the cost = 1, and the the reactions of external database and intermediate metabolites have an empty cost, before start runing the algorithm.
• calculate the cost to obtain each intermediate metabolite.The cost is the lower value from the reactions costs that produce it.The cost of a reaction is given by the sum of its reagents costs, when all reagents have an associated cost.In each algorithm iteration a set of new metabolites and reactions costs are calculated.When during an iteration is impossible to calculate any missing cost, because is no way to obtain a specific metabolite, the algorithm stops.
• insert into the database the reactions that belong to pathways with lower cost.Through the tolerance parameter, the higher accepted cost is given by the rule min c ost +(max cost − min cost) * tolerance/100.In Figure 3 we can observe one example of a set of selected reactions, if the cost for obtaining the MetaboliteX varies from 2 to 10, and the user chooses 30% for tolerance value, the reactions that will be picked have costs between 2 and 4.4.
Figure 3: This scheme represents the reactions that will be chosen for the metabolite X production.
In this example we can observe that the metabolite x can be produced through the reactions: Reac1, Reac2 and Reac3, with costs 2, 4 and 10 respectively.If the user chooses 30% for tolerance value, only the reactions Reac1 and Reac2 will be inserted in the database.This process is repeated for all metabolites from reactions Reac1 and Reac2.Note that the Met2 is already in the original metabolic model.This filter, based on the desired product, can be very useful to create a new database with reactions that belong to new pathways for producing the target compound.However, this filter may discard reactions that when added to the metabolic model could increase the yield of target compound or biomass production.

Mutant simulation by adding reactions
The phenotype simulation functionality allows mutant simulation by adding new reactions and optionally removing others from the model.After selecting the model to use, a previously loaded database is selected and the set of reactions to be added is chosen.Also, a set of knockouts can be selected.In Figure 4, the simulation interface is presented.During the configuration process, the user selects the simulation methods (FBA, MOMA or ROOM), the environmental conditions (the rates at which external metabolites can be consumed/ produced), and the objective function (e.g. the maximization/ minimization of a selected flux).
The result of mutant simulation can be observed in a specific interface (Figure 5), where the user can check the main results of the simulation: the list of added reactions, list of knockouts and values for all fluxes in the model.

Strain optimization by adding reactions
The strain optimization process tries to find a set of reactions from a database to be added to the model to improve a given objective function (e.g. the production of specific product).The search can be for only a set of reactions to be added or the combination of added reactions and knockouts.In the interface (Figure 6), the user selects: • algorithm: available optimization algorithms are EAs and SA; • simulation methods: to be used in the simulation of each solution evaluated (FBA, MOMA or ROOM); • objective function: used to calculate the fitness value of each solution; options are the Biomass-Product Coupled Yield (BPCY) [13] and Product Yield.The BPCY function is given by BPCY = (PG)/S , where P stands for the flux representing the excreted product; G for the organism's growth rate (biomass flux) and S for the substrate intake flux; • optimization basic setup: configure the maximum number of solution evaluations, the maximum number of knockouts and added reactions and if the genome size should be fixed or have a variable size; • environmental conditions: as defined for the simulation; • essential information: define if it is possible to knockout some special type of reactions like drains, transport and critical reactions.Furthermore, there are available interfaces to view the solutions found during the optimization task.For each, the list of knockouts, the list of added reactions and the flux values for all reactions of the metabolic model are displayed.

Rebuilding gaps in the metabolic model
In this case study, used for validation purposes, OptFlux simplification methods were used to identify reactions constrained to a flux value of zero in the E.coli model.The model is reduced eliminating those reactions and a database is created with the removed reactions (407 reactions).
In each run, three randomly selected reactions are further removed from the new reduced model and inserted into the database.The optimization methods must find these reactions and reintegrate them in the model to maximize biomass production.This process was repeated 10 times for SA and EA.The number of evaluations needed to find the optimal solution in each run are given in Table 1.

Vanillin case study
This case study aims to identify new pathways for the production of vanillin from glucose in E. coli and validate the implemented simulation method.To demonstrate the validity of the simulation process, we used the previous study with the OptStrain framework [16].To proceed with the test it was required to build the reactions database to add to the metabolic model.The new pathway added to the metabolic model can be observed in Figure 7.The simulation was performed for each of three sets of knockouts in the paper, considering the substrate flux of 10 mmol/gDW h −1 and the objective function the maximization of biomass production.The FBA was used as the simulation method in phenotype simulation.The obtained results agree with the ones from the previous work [16], thus validating our implementation.
The next step was to run the strain optimization processes to find a set of added reactions and knockouts, that maximizes vanillin production coupled with the organism growth.
The process was run 30 times for each EA and SA algorithms using as objective function the Biomass-Product Coupled Yield (BPCY).This function assigns the same importance for the biomass and target product production, multiplying their values.Before starting the optimization process, it was necessary to change the ids of the metabolites from the metabolic model to those used in the database.Otherwise, it would not be possible to integrate new reactions in the metabolic model.
The process parameters were configured taking EAs and SA encoding sets with a variable size.An important parameter is concerned with the maximum size allowed for both sets, containing respectrively the list of knockouts and the list of added reactions.Thus, four distinct configurations were tested with different combinations for the maximum number of knockouts and of added reactions as follows: config1 ⇐ number of knockouts: 20 and number of added reactions 10; config2 ⇐ number of knockouts: 10 and number of added reactions 10; config3 ⇐ number of knockouts: 7 and number of added reactions 7; config4 ⇐ number of knockouts: 5 and number of added reactions 5.
Figure 8 shows the distribution of obtained values for the biomass and vanilin production using each configuration for the EA and SA algorithms.
Table 2 shows the 95% confidence interval of results obtained in the optimization process, considering the best solution from each run.Comparing these results with the ones obtained in the previous study [16], we see that the BPCY value of their solution was 0.035 (BPCY = (6.787× 0.052)/10 = 0.035).Although the vanillin production is lower in our case, the BPCY value increased significantly given that the biomass is much higher, which mean that our strain has a larger growth rate.
Afterwards, we focused in increasing the production of vanillin, without treating the biomass formation as a priority.Considering this, the tests were repeated with a new objective function, by maximizing the flux of the product, ensuring a minimum limit of biomass production (5% of the wild type value).The results shown in Table 3 contain the best solution for each configuration, where the production of vanillin is higher than the obtained in [16].The smaller set of added reactions suggested by the optimization process include the reactions with KEGG [17] (http://www.genome.jp/kegg)ids: • R01216 -2-dehydropantoate formaldehyde-lyase; • R01627 -3-dehydroshikimate hydro-lyase; • R05273 -oxygen oxidoreductase; • R05274 -vanillate:oxygen oxidoreductase.
A supplementary file containing the full results obtained in the experiments summarized here is given in http://darwin.di.uminho.pt/jib/.

Conclusion
This paper presents methods for the simulation of strains by adding external reactions to the metabolic model, aiming to produce a target product with industrial interest or to fill gaps in metabolic model.In this approach, information is added to the stoichiometric model regarding new reactions, thus making an extension to the initial model.
Users can create their own database using files containing information on reactions and metabolites.After importing the database to the OptFlux platform it is possible to create new databases with a subset of reactions by applying some filters.Thus, the search space of possible solutions can be reduced this way.
Methods for strain optimization were developed, using EAs and SA, to find a sets of external reactions to be added and the necessary knockouts to maximize an objective function, typically related to the production of a compound of interest.
To provide these features to the scientific community, a plug-in has been developed for the Opt-Flux ME platform that allows simple and intuitive phenotype simulation and strain optimization with the addition of external reactions to the metabolic model.Thus, the tool set available for ME experts has been enlarged with useful techniques.
Future work will be devoted to the validation of these methods with other real world case studies.

Figure 1 :
Figure 1: Representation of the genome of an individual.Green squares represent reactions that will be added to the model (numbers are the reactions indexes in the external database).The knockouts are represented by red squares (numbers are indexes of reactions in the model).

Figure 2 :
Figure 2: Interface for selecting reactions and importing them to OptFlux.In this example, the user chooses only the reactions where ids start with "R" and that are reversible.

Figure 4 :
Figure 4: Interface for mutant simulation.The case study of vanillin production is shown here (see below).In the example, 4 knockouts and 4 added reactions are selected.

Figure 5 :
Figure 5: Interface showing simulation results: in the left the clipboard shows the main objects and in the right side the visualization of the main results of mutant simulation are shown in distinct tabs.

Figure 6 :
Figure6: Interface for strain optimization processes.In this example, an EA is configured, the simulation method is FBA, the objective function is BPCY, essential information uses the critical reactions, a maximum of 15 knockouts and 4 new reactions are permitted in variable sized sets.

Figure 8 :
Figure 8: This figure shows the results obtained by the EA and SA algorithms in the 30 runs for each of the configurations showing boxplots for the: a) the values of biomass fluxes; b) the values for the desired product flux (vanillin production) and c) the fitness values (BPCY).

Table 1 :
Number of function evaluations to find the optimal solution using SA and EA algorithms.

Table 2 :
The 95% confidence interval of results obtained in the optimization process.

Table 3 :
Best results of strain optimization for vanillin production using the Yield objective function for each algorithm (EA and SA).