Scientific legacy workflows are often developed over many years, poorly documented and implemented with scripting languages. In the context of our cross-disciplinary projects we face the problem of maintaining such scientific workflows. This paper presents the Workflow Instrumentation for Structure Extraction (WISE) method used to process several ad-hoc legacy workflows written in Python and automatically produce their workflow structural skeleton. Unlike many existing methods, WISE does not assume input workflows to be preprocessed in a known workflow formalism. It is also able to identify and analyze calls to external tools. We present the method and report its results on several scientific workflows.
The relation between distribution of hydrophobic amino acids along with protein chains and their structure is far from being completely understood. No reliable method allows ab initio prediction of the folded structure from this distribution of physicochemical properties, even when they are highly degenerated by considering only two classes: hydrophobic and polar. Establishment of long-range hydrophobic three dimension (3D) contacts is essential for the formation of the nucleus, a key process in the early steps of protein folding. Thus, a large number of 3D simulation studies were developed to challenge this issue. They are nowadays evaluated in a specific chapter of the molecular modeling competition, Critical Assessment of Protein Structure Prediction. We present here a simulation of the early steps of the folding process for 850 proteins, performed in a discrete 3D space, which results in peaks in the predicted distribution of intra-chain noncovalent contacts. The residues located at these peak positions tend to be buried in the core of the protein and are expected to correspond to critical positions in the sequence, important both for folding and structural (or similarly, energetic in the thermodynamic hypothesis) stability. The degree of stabilization or destabilization due to a point mutation at the critical positions involved in numerous contacts is estimated from the calculated folding free energy difference between mutated and native structures. The results show that these critical positions are not tolerant towards mutation. This simulation of the noncovalent contacts only needs a sequence as input, and this paper proposes a validation of the method by comparison with the prediction of stability by well-established programs.
Endosymbiotic theory suggests that plastids originated from a photosynthetic bacterium that was engulfed by a primitive eukaryotic cell. In consequence, the chloroplast genome remains affected by this ancestral event, although it is reduced in size and the number of constituent genes. Most parts of the plastid genome have been transferred to the host cell nuclear genome and are nuclear-encoded. Thus, chloroplast proteins are synthesized in the cytosol as precursors with N-terminal extensions called transit peptides. The evolution of import machinery was required to transfer transit peptides to the stroma. Until the present, two protein complexes have been found to mediate the import process: the Toc (outer) and Tic (inner) envelope membrane translocons. The evolutionary origin of many Tic and Toc proteins has been established, but not for the Tic110 subunit. Tic110 binds signal peptides and serves as a scaffold for the recruitment of stromal components. In this study, we analyzed hydrophobic clusters, protein folds, and protein structure homology and we conclude that Tic110 is composed of fourteen repeated motifs related to HEAT-repeats. The explanation for the presence of such repeats in Tic110 is that membrane arrangement is found in separate domains and their probable function in the chloroplast import process is discussed.
The role of exons can be studied on many levels, one of which pertains to protein structure. It is a well-known fact that secondary structural motifs do not directly correspond to exons: helices, β-sheets and loops have all been identified as encoded by more than one exon. The relation between exon fragments and their involvement in shaping the three-dimensional (3D) structure of a protein body is subject to ongoing studies. In particular, the role of exons in stabilizing tertiary structures can be related to the structure of the hydrophobic core of the protein. Participation of specific polypeptide fragments (single exons) in hydrophobic stabilization reveals the role played by each fragment. In the course of the presented research, exons in selected proteins have been identified on the basis of GenBank files, imported from the nucleotide database at the National Center of Biotechnology Information. Amino acid sequences representing each exon were subsequently traced to parts of 3D structural forms. The participation of each exon fragment in shaping the hydrophobic core of the protein was measured using divergence entropy calculations. It was found that each protein contains at least one exon which encodes a structural fragment in accordance with the theoretical hydrophobic core model. This implies that the likely role of at least one exon in each protein is to generate a hydrophobic core which is, in turn, responsible for tertiary structural stabilization.
The transition state ensemble during the folding process of globular proteins occurs when a sufficient number of intrachain contacts are formed, mainly, but not exclusively, due to hydrophobic interactions. These contacts are related to the folding nucleus, and they contribute to the stability of the native structure, although they may disappear after the energetic barrier of transition states has been passed. A number of structure and sequence analyses, as well as protein engineering studies, have shown that the signature of the folding nucleus is surprisingly present in the native three-dimensional structure, in the form of closed loops, and also in the early folding events. These findings support the idea that the residues of the folding nucleus become buried in the very first folding events, therefore helping the formation of closed loops that act as anchor structures, speed up the process, and overcome the Levinthal paradox. We present here a review of an algorithm intended to simulate in a discrete space the early steps of the folding process. It is based on a Monte Carlo simulation where perturbations, or moves, are randomly applied to residues within a sequence. In contrast with many technically similar approaches, this model does not intend to fold the protein but to calculate the number of non-covalent neighbors of each residue, during the early steps of the folding process. Amino acids along the sequence are categorized as most interacting residues (MIRs) or least interacting residues. The MIR method can be applied under a variety of circumstances. In the cases tested thus far, MIR has successfully identified the exact residue whose mutation causes a switch in conformation. This follows with the idea that MIR identifies residues that are important in the folding process. Most MIR positions correspond to hydrophobic residues; correspondingly, MIRs have zero or very low accessible surface area. Alongside the review of the MIR method, we present a new postprocessing method called smoothed MIR (SMIR), which refines the original MIR method by exploiting the knowledge of residue hydrophobicity. We review known results and present new ones, focusing on the ability of MIR to predict structural changes, secondary structure, and the improved precision with the SMIR method.