Scientific legacy workflows are often developed over many years, poorly documented and implemented with scripting languages. In the context of our cross-disciplinary projects we face the problem of maintaining such scientific workflows. This paper presents the Workflow Instrumentation for Structure Extraction (WISE) method used to process several ad-hoc legacy workflows written in Python and automatically produce their workflow structural skeleton. Unlike many existing methods, WISE does not assume input workflows to be preprocessed in a known workflow formalism. It is also able to identify and analyze calls to external tools. We present the method and report its results on several scientific workflows.
The transition state ensemble during the folding process of globular proteins occurs when a sufficient number of intrachain contacts are formed, mainly, but not exclusively, due to hydrophobic interactions. These contacts are related to the folding nucleus, and they contribute to the stability of the native structure, although they may disappear after the energetic barrier of transition states has been passed. A number of structure and sequence analyses, as well as protein engineering studies, have shown that the signature of the folding nucleus is surprisingly present in the native three-dimensional structure, in the form of closed loops, and also in the early folding events. These findings support the idea that the residues of the folding nucleus become buried in the very first folding events, therefore helping the formation of closed loops that act as anchor structures, speed up the process, and overcome the Levinthal paradox. We present here a review of an algorithm intended to simulate in a discrete space the early steps of the folding process. It is based on a Monte Carlo simulation where perturbations, or moves, are randomly applied to residues within a sequence. In contrast with many technically similar approaches, this model does not intend to fold the protein but to calculate the number of non-covalent neighbors of each residue, during the early steps of the folding process. Amino acids along the sequence are categorized as most interacting residues (MIRs) or least interacting residues. The MIR method can be applied under a variety of circumstances. In the cases tested thus far, MIR has successfully identified the exact residue whose mutation causes a switch in conformation. This follows with the idea that MIR identifies residues that are important in the folding process. Most MIR positions correspond to hydrophobic residues; correspondingly, MIRs have zero or very low accessible surface area. Alongside the review of the MIR method, we present a new postprocessing method called smoothed MIR (SMIR), which refines the original MIR method by exploiting the knowledge of residue hydrophobicity. We review known results and present new ones, focusing on the ability of MIR to predict structural changes, secondary structure, and the improved precision with the SMIR method.
The relation between distribution of hydrophobic amino acids along with protein chains and their structure is far from being completely understood. No reliable method allows ab initio prediction of the folded structure from this distribution of physicochemical properties, even when they are highly degenerated by considering only two classes: hydrophobic and polar. Establishment of long-range hydrophobic three dimension (3D) contacts is essential for the formation of the nucleus, a key process in the early steps of protein folding. Thus, a large number of 3D simulation studies were developed to challenge this issue. They are nowadays evaluated in a specific chapter of the molecular modeling competition, Critical Assessment of Protein Structure Prediction. We present here a simulation of the early steps of the folding process for 850 proteins, performed in a discrete 3D space, which results in peaks in the predicted distribution of intra-chain noncovalent contacts. The residues located at these peak positions tend to be buried in the core of the protein and are expected to correspond to critical positions in the sequence, important both for folding and structural (or similarly, energetic in the thermodynamic hypothesis) stability. The degree of stabilization or destabilization due to a point mutation at the critical positions involved in numerous contacts is estimated from the calculated folding free energy difference between mutated and native structures. The results show that these critical positions are not tolerant towards mutation. This simulation of the noncovalent contacts only needs a sequence as input, and this paper proposes a validation of the method by comparison with the prediction of stability by well-established programs.