Papers are in PDF or PostScript.
Cell signaling pathways, which are a series of reactions that start at receptors and end at transcription factors, are basic to systems biology. Properly modeling the reactions in such pathways requires directed hypergraphs, where an edge is now directed between two sets of vertices. Inferring a pathway by the most parsimonious series of reactions then corresponds to finding a shortest hyperpath in a directed hypergraph, which is NP-complete. The state of the art for shortest hyperpaths in cell-signaling hypergraphs solves a mixed-integer linear program to find an optimal hyperpath that is restricted to be acyclic, and offers no efficiency guarantees.
We present for the first time a heuristic for general shortest hyperpaths that properly handles cycles, and is guaranteed to be efficient. Its accuracy is demonstrated through exhaustive experiments on all instances from the standard NCI-PID and Reactome pathway databases, which show the heuristic finds a hyperpath that matches the state-of-the-art mixed-integer linear program on over 99% of all instances that are acyclic. On instances where only cyclic hyperpaths exist, the heuristic surpasses the state-of-the-art, which finds no solution; on every such cyclic instance, enumerating all possible hyperpaths shows that the solution found by the heuristic is in fact optimal.
We describe the Arizona-NOIRLab Temporal Analysis and Response to Events System (ANTARES), a software instrument designed to process large-scale streams of astronomical time-domain alerts. With the advent of large-format CCDs on wide-field imaging telescopes, time-domain surveys now routinely discover tens of thousands of new events each night, more than can be evaluated by astronomers alone. The ANTARES event broker will process alerts, annotating them with catalog associations and filtering them to distinguish customizable subsets of events. We describe the data model of the system, the overall architecture, annotation, implementation of filters, system outputs, provenance tracking, system performance, and the user interface.
Protein secondary structure prediction is a fundamental task in computational biology, basic to many bioinformatics workflows, with a diverse collection of tools currently available. An approach from machine learning with the potential to capitalize on such a collection is ensemble prediction, which runs multiple predictors and combines their predictions into one, output by the ensemble.
We conduct a thorough study of seven different approaches to ensemble secondary structure prediction, several of which are novel, and show we can indeed obtain an ensemble method that significantly exceeds the accuracy of individual state-of-the-art tools. The best approaches build on a recent technique known as feature-based accuracy estimation, which estimates the unknown true accuracy of a prediction, here using features of both the prediction output and the internal state of the prediction method. In particular, a hybrid approach to ensemble prediction that leverages accuracy estimation is now the most accurate method currently available: on average over standard CASP and PDB benchmarks, it exceeds the state-of-the-art Q3 accuracy for 3-state prediction by nearly 4%, and exceeds the Q8 accuracy for 8-state prediction by more than 8%.
A preliminary implementation of our approach to ensemble protein secondary structure prediction, in a new tool we call Ssylla, is available free for non-commercial use at ssylla.cs.arizona.edu.
Protein secondary structure prediction is a fundamental precursor to many bioinformatics tasks. Nearly all state-of-the-art tools when computing their secondary structure prediction do not explicitly leverage the vast number of proteins whose structure is known. Leveraging this additional information in a template-based method has the potential to significantly boost prediction accuracy.
We present a new hybrid approach to secondary structure prediction that gains the advantages of both template-based and template-free methods. Our core template-based method is an algorithmic approach that uses metric-space nearest neighbor search over a template database of fixed-length amino-acid words to determine estimated class-membership probabilities for each residue in the protein. These probabilities are then input to a dynamic programming algorithm that finds a physically-valid maximum-likelihood prediction for the entire protein. Our hybrid approach exploits a novel accuracy estimator for our core method, which estimates the unknown true accuracy of its prediction, to discern when to switch between template-based and template-free methods.
On challenging CASP benchmarks, the hybrid approach boosts the state-of-the-art Q8 accuracy by as much as 10%, and Q3 accuracy by as much as 3%, yielding the most accurate method currently available for both 3- and 8-state secondary structure prediction.
A preliminary implementation in a new tool we call Nnessy is available free for non-commercial use at http://nnessy.cs.arizona.edu.
While mutation rates can vary markedly over the residues of a protein, multiple sequence alignment tools typically use the same values for their scoring-function parameters across a protein's entire length. We present a new approach, called adaptive local realignment, that in contrast automatically adapts to the diversity of mutation rates along protein sequences. This builds upon a recent technique known as parameter advising, which finds global parameter settings for an aligner, to now adaptively find local settings. Our approach in essence identifies local regions with low estimated accuracy, constructs a set of candidate realignments using a carefully-chosen collection of parameter settings, and replaces the region if a realignment has higher estimated accuracy. This new method of local parameter advising, when combined with prior methods for global advising, boosts alignment accuracy as much as 26% over the best default setting on hard-to-align protein benchmarks, and by 6.4% over global advising alone. Adaptive local realignment has been implemented within the Opal ligner using the Facet accuracy estimator.
The unprecedented volume and rate of transient events that will be discovered by the Large Synoptic Survey Telescope (LSST) demand that the astronomical community update its follow-up paradigm. Alert-brokers--automated software systems to sift through, characterize, annotate, and prioritize events for follow-up--will be critical tools for managing alert streams in the LSST era. The Arizona-NOAO Temporal Analysis and Response to Events System (ANTARES) is one such broker. In this work, we develop a machine learning pipeline to characterize and classify variable and transient sources only using the available multiband optical photometry. We describe three illustrative stages of the pipeline, serving the three goals of early, intermediate, and retrospective classification of alerts. The first takes the form of variable versus transient categorization, the second a multiclass typing of the combined variable and transient data set, and the third a purity-driven subtyping of a transient class. Although several similar algorithms have proven themselves in simulations, we validate their performance on real observations for the first time. We quantitatively evaluate our pipeline on sparse, unevenly sampled, heteroskedastic data from various existing observational campaigns, and demonstrate very competitive classification performance. We describe our progress toward adapting the pipeline developed in this work into a real-time broker working on live alert streams from time-domain surveys.
While multiple sequence alignment is essential to many biological analyses, its standard formulations are all NP-complete. Due to both its practical importance and computational difficulty, a plethora of heuristic multiple sequence aligners are in use in bioinformatics. Each of these tools has a multitude of parameters which must be set, and that greatly affect the quality of the output alignment. How to choose the best parameter setting for a user’s input sequences is a basic question, and most users simply rely on the aligner’s default setting, which may produce a low-quality alignment of their specific sequences.
In this monograph, we present a new general approach called parameter advising for finding a parameter setting that produces a high-quality alignment for a given set of input sequences. In this framework, a parameter advisor is a procedure that automatically chooses a parameter setting for the aligner, and has two main ingredients: (a) the set of parameter choices considered by the advisor, and (b) an estimator of alignment accuracy used to rank alignments produced by the aligner. On coupling a parameter advisor with an aligner, once the advisor is trained in a learning phase, the user simply inputs sequences to align and receives an output alignment from the aligner, where the advisor has automatically selected the parameter setting.
The book is organized in two parts: the first lays out the foundations of parameter advising, and the second provides applications and extensions of advising. The content examines formulations of parameter advising and their computational complexity, develops methods for learning good accuracy estimators, presents approximation algorithms for finding good sets of parameter choices, and assesses software implementations of advising that perform well on real biological data. Also explored are applications of parameter advising to adaptive local realignment, where advising is performed on local regions of the sequences to automatically adapt to varying mutation rates; and ensemble alignment, where advising is applied to an ensemble of aligners to effectively yield a new aligner of higher quality than the individual aligners in the ensemble. Finally, future directions in advising research are offered.
This work arose from a series of joint research papers by the coauthors, that initiated and developed the theory and practice of parameter advising, and that formed the basis of the first author’s doctoral dissertation.
Parameter advising is a general technique, with the potential to be of broad utility beyond sequence alignment. We hope this monograph encourages others to explore this fruitful area of investigation.
In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted.
We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.
While mutation rates can vary markedly over the residues of a protein, multiple sequence alignment tools typically use the same values for their scoring-function parameters across a protein’s entire length. We present a new approach, called adaptive local realignment, that in contrast automatically adapts to the diversity of mutation rates along protein sequences. This builds upon a recent technique known as parameter advising that finds global parameter settings for aligners, to adaptively find local settings. Our approach in essence identifies local regions with low estimated accuracy, constructs a set of candidate realignments using a carefully-chosen collection of parameter settings, and replaces the region if a realignment has higher estimated accuracy. This new method of local parameter advising, when combined with prior methods for global advising, boosts alignment accuracy as much as 26% over the best default setting on hard-to-align protein benchmarks, and by 6.4% over global advising alone. Adaptive local realignment, implemented within the Opal aligner using the Facet accuracy estimator, is available at facet.cs.arizona.edu.
While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for the alignment scoring function (such as the choice of gap penalties and substitution scores), most users rely on the single default parameter setting provided by the aligner. A different parameter setting, however, might yield a much higher-quality alignment for the specific set of input sequences. The problem of picking a good choice of parameter values for specific input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that provides an estimate of the accuracy of the alignment computed by the aligner using a parameter choice. The parameter advisor picks the parameter choice from the set whose resulting alignment has highest estimated accuracy.
We consider for the first time the problem of learning the optimal set of parameter choices for a parameter advisor that uses a given accuracy estimator. The optimal set is one that maximizes the expected true accuracy of the resulting parameter advisor, averaged over a collection of training data. While we prove that learning an optimal set for an advisor is NP-complete, we show there is a natural approximation algorithm for this problem, and prove a tight bound on its approximation ratio. Experiments with an implementation of this approximation algorithm on biological benchmarks, using various accuracy estimators from the literature, show it finds sets for advisors that are surprisingly close to optimal. Furthermore, the resulting parameter advisors are significantly more accurate in practice than simply aligning with a single default parameter choice.
Measuring execution time is one of the most used performance evaluation techniques in computer science research. Inaccurate measurements cannot be used for a fair performance comparison between programs. Despite the prevalence of its use, the intrinsic variability in the time measurement makes it hard to obtain repeatable and accurate timing results of a program running on an operating system. We propose a novel execution time measurement protocol (termed EMP) for measuring the execution time of a compute-bound program on Linux, while minimizing that measurement's variability. During the development of execution time measurement protocol, we identified several factors that disturb execution time measurement. We introduce successive refinements to the protocol by addressing each of these factors, in concert, reducing variability by more than an order of magnitude. We also introduce a new visualization technique, what we term "dual-execution scatter plot" that highlights infrequent, long-running daemons, differentiating them from frequent and/or short-running daemons. Our empirical results show that the proposed protocol successfully achieves three major aspects--precision, accuracy, and scalability--in execution time measurement that can work for open-source and proprietary software.
In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the folded three-dimensional structures of the proteins. When computing a protein multiple sequence alignment in practice, a reference alignment is not known, so its coreness can only be predicted.
We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.
The multiple sequence alignments computed by an aligner for different settings of its parameters, as well as the alignments computed by different aligners using their default settings, can differ markedly in accuracy. Parameter advising is the task of choosing a parameter setting for an aligner to maximize the accuracy of the resulting alignment. We extend parameter advising to aligner advising, which in contrast chooses among a set of aligners to maximize accuracy. In the context of aligner advising, default advising selects from a set of aligners that are using their default settings, while general advising selects both the aligner and its parameter setting.
In this paper, we apply aligner advising for the first time, to create a true ensemble aligner. Through cross-validation experiments on benchmark protein sequence alignments, we show that parameter advising boosts an aligner's accuracy beyond its default setting for virtually all of the standard aligners currently used in practice. Furthermore, aligner advising with a collection of aligners further improves upon parameter advising with any single aligner, though surprisingly the performance of default advising on testing data is actually superior to general advising due to less overfitting to training data.
The new ensemble aligner that results from aligner advising is significantly more accurate than the best single default aligner, especially on hard-to-align sequences. This successfully demonstrates how to construct out of a collection of individual aligners, a more accurate ensemble aligner.
While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for the alignment scoring function (such as the choice of gap penalties and substitution scores), most users rely on the single default parameter setting provided by the aligner. A different parameter setting, however, might yield a much higher-quality alignment for the specific set of input sequences. The problem of picking a good choice of parameter values for specific input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that provides an estimate of the accuracy of the alignment computed by the aligner using a parameter choice. The parameter advisor picks the parameter choice from the set whose resulting alignment has highest estimated accuracy.
We consider for the first time the problem of learning the optimal set of parameter choices for a parameter advisor that uses a given accuracy estimator. The optimal set is one that maximizes the expected true accuracy of the resulting parameter advisor, averaged over a collection of training data. While we prove that learning an optimal set for an advisor is NP-complete, we show there is a natural approximation algorithm for this problem, and prove a tight bound on its approximation ratio. Experiments with an implementation of this approximation algorithm on biological benchmarks, using various accuracy estimators from the literature, show it finds sets for advisors that are surprisingly close to optimal. Furthermore, the resulting parameter advisors are significantly more accurate in practice than simply aligning with a single default parameter choice.
We develop a novel and general approach to estimating the accuracy of protein multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new task that we call parameter advising: the problem of choosing values for alignment scoring function parameters from a given set of choices to maximize the accuracy of a computed alignment.
For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. Compared to prior approaches for estimating accuracy, our new approach (a) introduces novel feature functions that measure nonlocal properties of an alignment yet are fast to evaluate, (b) considers more general classes of estimators beyond linear combinations of features, and (c) develops new regression formulations for learning an estimator from examples; in addition, we (d) determine the optimal parameter set of a given cardinality, which specifies the best parameter values from which to choose.
Our estimator, which we call Facet (for "feature-based accuracy estimator"), yields a parameter advisor that on the hardest benchmarks provides more than a 27% improvement in accuracy over the best default parameter choice, and for parameter advising significantly outperforms the best prior approaches to assessing alignment quality.
We develop a novel and general approach to estimating the accuracy of protein multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new problem that we call parameter advising. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. We evaluate this approach by applying it to the task of parameter advising: the problem of choosing alignment scoring parameters from a collection of parameter values to maximize the accuracy of a computed alignment. Our estimator, which we call Facet (for "feature-based accuracy estimator"), yields a parameter advisor that on the hardest benchmarks provides more than a 20% improvement in accuracy over the best default parameter choice, and outperforms the best prior approaches to selecting good alignments for parameter advising.
Automatic design of synthetic gene circuits poses a significant challenge to synthetic biology, primarily due to the complexity of biological systems, and the lack of rigorous optimization methods that can cope with the combinatorial explosion as the number of biological parts increases. Current optimization methods for synthetic gene design rely on heuristic algorithms that are usually not deterministic, deliver sub-optimal solutions, and provide no guaranties on convergence or error bounds. Here, we introduce an optimization framework for the problem of part selection in synthetic gene circuits that is based on mixed integer non-linear programming (MINLP), which is a deterministic method that finds the globally optimal solution and guarantees convergence in finite time. Given a synthetic gene circuit, a library of characterized parts, and user-defined constraints, our method can find the optimal selection of parts that satisfy the constraints and best approximates the objective function given by the user. We evaluated the proposed method in the design of three synthetic circuits (a toggle switch, a transcriptional cascade, and a band detector), with both experimentally constructed and synthetic promoter libraries. Scalability and robustness analysis shows that the proposed framework scales well with the library size and the solution space. The work described here is a step towards a unifying, realistic framework for the automated design of biological circuits.
Synthetic biology aspires to revolutionize the way we construct biological circuits, as it promises fast time-to-market synthetic systems through part standardization, model abstraction, design and process automation. However, the automated design of synthetic circuits remains an unsolved problem, despite the increasing number of practitioners in the field. One reason behind that, is the absence of an efficient mathematical formulation for the combinatorial optimization problem of selecting genes and promoters when synthesizing the candidate circuits. Here, we propose an optimization framework that is based on a linear relaxation of the non-linear optimization problem, which proves to be a good approximation of the non-linear dynamics present in biological systems. Further evaluation of the proposed framework in a real non-linear synthetic circuit (a toggle switch), and with the use of a mutant promoter library, resulted in a rapid and reproducible convergence to a synthetic circuit that exhibits the desired characteristics and temporal expression profiles. This work is a step towards a unifying, realistic framework for the automated construction of biological circuits with desired temporal profiles and user-defined constraints.
Accurately aligning distant protein sequences is notoriously difficult. Since the amino acid sequence alone often does not provide enough information to obtain accurate alignments under the standard alignment scoring functions, a recent approach to improving alignment accuracy is to use additional information such as secondary structure. We make several advances in alignment of protein sequences annotated with predicted secondary structure: (1) more accurate models for scoring alignments, (2) efficient algorithms for optimal alignment under these models, and (3) improved learning criteria for setting model parameters through inverse alignment, as well as (4) in-depth experiments evaluating model variants on benchmark alignments. More specifically, the new models use secondary structure predictions and their confidences to modify the scoring of both substitutions and gaps. All models have efficient algorithms for optimal pairwise alignment that run in near-quadratic time. These models have many parameters, which are rigorously learned using inverse alignment under a new criterion that carefully balances score error and recovery error.
We then evaluate these models by studying how accurately an optimal alignment under the model recovers benchmark reference alignments that are based on the known three-dimensional structures of the proteins. The experiments show that these new models provide a significant boost in accuracy over the standard model for distant sequences. The improvement for pairwise alignment is as much as 15% for sequences with less than 25% identity, while for multiple alignment the improvement is more than 20% for difficult benchmarks whose accuracy under standard tools is at most 40%.
Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted secondary structure. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in near-quadratic time. We also review an approach to learning the values of the parameters in these models called inverse alignment. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondary-structure-based models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%.
When aligning biological sequences, the choice of scoring scheme is critical. Even small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to learn parameter values that are appropriate for biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct reference alignments, this is the problem of finding parameter values that make the scores of the reference alignments be as close as possible to those of optimal alignments of their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the reference alignment is not specified, and to an improved formulation based on minimizing the average error between the scores of the reference alignments and the scores of optimal alignments. Experiments on benchmark biological alignments show we can learn scoring schemes that generalize across protein families, and that boost the accuracy of multiple sequence alignment by as much as 25 percent.
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for biological sequences is inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the example alignments score close to optimal. We extend prior work on inverse alignment to partial examples and to an improved model based on minimizing the average error of the examples. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the recovery rate for multiple sequence alignment by up to 25%.
Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment.
We study six stages in the form-and-polish strategy for multiple alignment: parameter choice, distance estimation, merge-tree construction, sequence-pair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3-cuts. Experiments with a parameter-value oracle suggest large gains in quality may be possible through an input-dependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties.
For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. While some choices for substitution scores are now common, largely due to convention, there is no standard for choosing gap penalties. An objective way to resolve this question is to learn the appropriate values by solving the Inverse String Alignment Problem: given examples of correct alignments, find parameter values that make the examples be optimal-scoring alignments of their strings.
We present a new polynomial-time algorithm for Inverse String Alignment that is simple to implement, fast in practice, and for the first time can learn hundreds of parameters simultaneously. The approach is also flexible: minor modifications allow us to solve inverse unique alignment (find parameter values that make the examples be the unique optimal alignments of their strings), and inverse near-optimal alignment (find parameter values that make the example alignments be as close to optimal as possible). Computational results with an implementation for global alignment show that, for the first time, we can find best-possible values for all 212 parameters of the standard protein-sequence scoring-model from hundreds of alignments in a few minutes of computation.
Software watermarks -- bitstrings encoding some sort of identifying information that are embedded into executable programs -- are an important tool against software piracy. Most existing proposals for software watermarking have the shortcoming that they can be destroyed via fairly straightforward semantics-preserving code transformations. This paper introduces path-based watermarking, a new approach to software watermarking based on the dynamic branching behavior of programs. We show how error-correcting and tamper-proofing techniques can be used to make path-based watermarks resilient against a wide variety of attacks. Experimental results, using both Java bytecode and IA-32 native code, indicate that even relatively large watermarks can be embedded into programs at modest cost.
A basic computational problem that arises in both the construction and local-search phases of the best heuristics for multiple sequence alignment is that of aligning the columns of two multiple alignments. When the scoring function is the sum-of-pairs objective and induced pairwise alignments are evaluated using linear gap-costs, we call this problem Aligning Alignments. While seemingly a straightforward extension of two-sequence alignment, we prove it is actually NP-complete. As explained in the paper, this provides the first demonstration that minimizing linear gap-costs, in the context of multiple sequence alignment, is inherently hard.
We also develop an exact algorithm for Aligning Alignments that is remarkably efficient in practice, both in time and space. Even though the problem is NP-complete, computational experiments on both biological and simulated data show we can compute optimal alignments for all benchmark instances in two standard datasets, and solve very-large random instances with highly-gapped sequences.
"A great deal of software is distributed in the form of executable code. The ability to reverse engineer such executables can create opportunities for theft of intellectual property via software piracy, as well as security breaches by allowing attackers to discover vulnerabilities in an application. Techniques such as watermarking and fingerprinting have been developed to discourage piracy, however, if no protective measures are taken, an attacker may be able to remove and/or destroy watermarks and fingerprints with relative ease once they have been identified. For this reason, methods such as source code obfuscation, code encryption, and self verifying code have been developed to help achieve some measure of tamper-resistance.
"It is, of course, necessary for an attacker to gain a reliable disassembly of some portion of executable code before any intelligent tampering can take place. In fact, even a reliable disassembly in the absence of some sort of control flow graph is not sufficient for serious tampering. Coupled with other methods, we propose one method of obfuscating address computations in which the targets of control transfers are made difficult to determine statically ...."
One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are overlaid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.
We present a new method for reconstructing the distances between probes in physical maps of chromosomes constructed by hybridizing pairs of clones under the so-called sampling-without-replacement protocol. In this protocol, which is simple, inexpensive, and has been used to successfully map several organisms, equal-length clones are hybridized against a clone-subset called the probes. The probes are chosen by a sequential process that is designed to generate a pairwise-nonoverlapping subset of the clones. We derive a likelihood function on probe spacings and orders for this protocol under a natural model of hybridization error, and describe how to reconstruct the most likely spacing for a given order under this objective using continuous optimization. The approach is tested on simulated data and real data from chromosome VI of Aspergillus nidulans. On simulated data we recover the true order and close to the true spacing; on the real data, for which the true order and spacing is unknown, we recover a probe order differing significantly from the published one. To our knowledge this is the first practical approach for computing a globally-optimal maximum-likelihood reconstruction of interprobe distances from clone-probe hybridization data.
We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branch-and-cut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them the original formulation of Maximum Trace. The RNA Sequence Alignment Problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. Both problems have a characterization in terms of graphs which we reformulate in terms of integer linear programming. We then study the polytopes (or convex hulls of all feasible solutions) associated with the integer linear program for both problems. For each polytope we derive several classes of facet-defining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. This leads to a polynomial time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branch-and-cut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches.
We introduce a new combinatorial formulation of chromosome physical-mapping by the sampling-without-replacement protocol. In this protocol, which is simple, inexpensive, and has been used to successfully map several organisms, equal-length clones are hybridized against a subset of the clones called probes, which are designed to form a maximal nonoverlapping clone-subset. The output of the protocol is the clone-probe hybridization matrix H. The problem of finding a maximum-likelihood reconstruction of the order of the probes along the chromosome in the presence of false positive and negative hybridization error is equivalent to finding the minimum number of entries of H to change to zeros so that the resulting matrix has at most 2 ones per row, and the consecutive-ones property across rows. This combinatorial problem, which we call 2-Consecutive-Ones Mapping, has a concise integer linear-programming formulation, to which we apply techniques from polyhedral combinatorics. The formulation is unique in that it does not explicitly represent the probe permutation, and in contrast to prior linear-programming approaches, the number of variables is small: in practice, linear in the number of clones. We derive a large class of facet-defining inequalities for the 2-consecutive-ones polytope that we call the augmented k-degree inequalities, and we show that the basic k-degree class can be efficiently separated using bipartite matchings. Computational results with an implementation of the resulting branch-and-cut algorithm applied to both simulated and real data from the complete genome of Aspergillus nidulans show that we can solve many problems to provable optimality and find maps of higher quality than previously possible.
We give an experimental study of a new O(mn a(m,n))-time implementation of Edmonds' algorithm for maximum-cardinality matching in sparse general graphs with n vertices and m edges. The implementation incorporates several optimizations resulting from searching for augmenting paths depth-first, and we study the iteraction among a set of heuristics, each with the potential to significantly speed up the code, through experiments with all possible variants. The experiments indicate that the simplest heuristic, an early-termination test for the depth-first search, results in the greatest performance gain, and yields an implementation that on graphs with large degree actually finds an optimal solution in less time than a standard greedy heuristic. The resulting code appears to be the fastest among those publicly available on the classes of random, k-regular, union-of-k-cycle, and Euclidean k-nearest-neighbor graphs with up to 100,000 vertices, 500,000 edges, and average degree up to 10, achieving a maximum speedup of 50 over the LEDA codes, and 4 and 350 over two of the DIMACS implementation challenge codes, while never taking longer than these implementations.
While the area of sequence comparison has a rich collection of results on the alignment of two sequences, and even the alignment of multiple sequences, there is little known about the alignment of two alignments. The problem becomes interesting when the alignment objective function counts gaps, as is common when aligning biological sequences, and has the form of the sum-of-pairs objective. We begin a thorough investigation of aligning two alignments under the sum-of-pairs objective with general linear gap costs when either of the two alignments are given in the form of a sequence (a degenerate alignment containing a single sequence), a multiple alignment (containing two or more sequences), or a profile (a representation of a multiple alignment often used in computational biology). This leads to five problem variations, some of which arise in widely-used heuristics for multiple sequence alignment, and in assessing the relatedness of a sequence to a sequence family. For variations in which exact gap counts are computationally difficult to determine, we offer a framework in terms of optimistic and pessimistic gap counts. For optimistic and pessimistic gap counts we give efficient algorithms for the sequence vs. alignment, sequence vs. profile, alignment vs. alignment, and profile vs. profile variations, all of which run in essentially O(mn) time for two input alignments of lengths m and n. For exact gap counts, we give the first provably efficient algorithm for the sequence vs. alignment variation, which runs in essentially O(mn log n) time using the candidate-list technique developed for convex gap-costs, and we conjecture that the alignment vs. alignment variation is NP-complete.
We consider the problem of multiple sequence alignment under a fixed evolutionary tree: given a tree whose leaves are labeled by sequences, find ancestral sequences to label its internal nodes so as to minimize the total length of the tree, where the length of an edge is the edit distance between the sequences labeling its endpoints. We present a new polynomial-time approximation algorithm for this problem, and analyze its performance on regular d-ary trees with d a constant. On such a tree, the algorithm finds a solution within a factor (d+1)/(d-1) of the minimum in O(kd T(d,n) + k2d n2) time, where k is the number of leaves in the tree, n is the length of the longest sequence labeling a leaf, and T(d,n) is the time to compute a Steiner point for d sequences of length at most n. (A Steiner point for a set S of sequences is a sequence P that minimizes the sum of the edit distances from P to each sequence in S. The time T(d,n) is O(d 2d nd), given O(d sd+1)-time preprocessing for an alphabet of size s.) The approximation algorithm is conceptually simple and easy to implement, and actually applies to any metric space in which a Steiner point for any fixed-sized set can be computed in polynomial time.
We also introduce a new problem, bottleneck tree-alignment, in which the objective is to label the internal nodes of the tree so as to minimize the length of the longest edge. We describe an exponential-time exact algorithm for the case of unit-cost edit operations, and show there is a simple linear-time approximation algorithm for the general case that finds a solution within a factor O(log k) of the minimum.
One of the classic problems in computational biology is the reconstruction of evolutionary history. A recent trend in the area is to increase the explanatory power of the models that are considered by incorporating higher-order evolutionary events that more accurately reflect the mechanisms of mutation at the level of the chromosome.
We take a step in this direction by considering the problem of reconstructing an evolutionary history for a set of genetic sequences that have evolved by recombination. Recombination is a non-tree-like event that produces a child sequence by crossing two parent sequences. We present polynomial-time algorithms for reconstructing a parsimonious history of such events for several models of recombination when all sequences, including those of ancestors, are present in the input. We also show that these models appear to be near the limit of what can be solved in polynomial time, in that several natural generalizations are NP-complete.
A major challenge in computational biology is the identification of evolutionarily related macromolecules in cases of distant homology, where pairwise sequence similarity may be low, or even insignificant. Methods for the quantitative assessment of such distant relationships must go beyond simple pairwise sequence comparison, and exploit all the information available in multiple alignments of related sequences. Evaluation of the statistical significance of a pairwise alignment score by random shuffling is an established approach, which we extend here to the more general case of evaluating the similarity between a query sequence and a prealigned family of macromolecules whose multiple alignment may contain gaps. The method involves (1) the optimal alignment of the query sequence against the prealigned family, for which we develop a new algorithm that accurately takes into account the structure of gaps in the alignment, followed by (2) repeated shuffling of the query sequence and calculation of a shuffling statistic to assess the significance of the query-family homology, for which we consider four different measures. We apply the method to several data sets of interest, and compare its sensitivity to the traditional method of pairwise sequence comparison. Overall, the significance score for the query-family homology assessed by the new method exceeded the median significance score assessed by the traditional pairwise method by 2.2 standard deviations on average. This suggests that the method is consistently more sensitive for assessing distant homology than the conventional approach based on pairwise sequence comparison.
"To the Editor: A progressive decline in plasma or serum selenium levels paralleling the loss of CD4+ T cells has been widely documented in HIV-1 infections .... In light of this body of clinical evidence, we would like to report an observation of potential significance for the molecular mechanisms underlying any interaction between HIV and dietary selenium ...."
A fundamental problem in computational biology is the construction of physical maps of chromosomes from hybridization experiments between unique probes and clones of chromosome fragments in the presence of error. Alizadeh, Karp, Weisser and Zweig (Algorithmica 13:1/2, 52-76, 1995) first considered a maximum-likelihood model of the problem that is equivalent to finding an ordering of the probes that minimizes a weighted sum of errors, and developed several effective heuristics. We show that by exploiting information about the end-probes of clones, this model can be formulated as a weighted Betweenness Problem. This affords the significant advantage of allowing the well-developed tools of integer linear-programming and branch-and-cut algorithms to be brought to bear on physical mapping, enabling us for the first time to solve small mapping instances to optimality even in the presence of high error. We also show that by combining the optimal solution of many small overlapping Betweenness Problems, one can effectively screen errors from larger instances, and solve the edited instance to optimality as a Hamming-Distance Traveling Salesman Problem. This suggests a new approach, a Betweenness-Traveling Salesman hybrid, for constructing physical maps.
We show how to efficiently reconstruct an original DNA sequence of length n with high probability from o(log2 n) copies containing insertion, deletion, and substitution errors, assuming the sequence itself is random and errors are random with constant error rate.
We present some experiences with the problem of multiple genome comparison, analogous to multiple sequence alignment in sequence comparison, under the inversion and transposition distance metrics, given a fixed phylogeny. We first describe a heuristic for the case in which the phylogeny is a star on three vertices and then use this to approximate the multiple genome comparison problem via local search.
The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program implements a branch-and-bound technique together with a variant of Dijkstra's shortest paths algorithm to prune the basic dynamic programming graph. We have made substantial improvements in the time and space usage of MSA. The improvements make feasible a variety of problem instances that were not feasible previously. On some runs we achieve an order of magnitude reduction in space usage and a significant multiplicative factor speedup in running time. To explain how these improvements work, we give a much more detailed description of MSA than has been previously available. In practice, MSA rarely produces a provably optimal alignment and we explain why.
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed non-repetitive sequences of length 50,000 sampled at error rates of as high as 10 percent.
Motivated by the problem in computational biology of reconstructing the series of chromosome inversions by which one organism evolved from another, we consider the problem of computing the shortest series of reversals that transform one permutation to another. The permutations describe the order of genes on corresponding chromosomes, and a reversal takes an arbitrary substring of elements and reverses their order.
For this problem we develop two algorithms: a greedy approximation algorithm that finds a solution provably close to optimal in O(n2) time and O(n) space for an n element permutation, and a branch and bound exact algorithm that finds an optimal solution in O(m L(n,n)) time and O(n2) space, where m is the size of the branch and bound search tree and L(n,n) is the time to solve a linear program of n variables and n constraints. The greedy algorithm is the first to come within a constant factor of the optimum; it guarantees a solution that uses no more than twice the minimum number of reversals. The lower and upper bounds of the branch and bound algorithm are a novel application of maximum weight matchings, shortest paths, and linear programming.
In a series of experiments we study the performance of an implementation on random permutations and permutations generated by random reversals. For permutations differing by k random reversals we find that the average upper bound on reversal distance estimates k to within one reversal for k less than (1/2)n and n at most 100. For random permutations we find that the average difference between the upper and lower bounds is less than 3 reversals for n at most 50. Due to the tightness of these bounds we can solve to optimality random permutations on 30 elements in a few minutes of computer time.
A new and largely unexplored area of computational biology is combinatorial algorithms for genome rearrangement. In the course of its evolution, the genome of an organism mutates by processes that can rearrange whole segments of a chromosome in a single event. These rearrangement mechanisms include inversion, transposition, duplication, and translocation, and a basic problem is to determine the minimum number of such events that transform one genome to another. This number is called the rearrangement distance between the two genomes, and gives a lower bound on the number of events that must have occurred since their divergence, assuming evolution proceeds according to the processes of the study.
In this paper, we begin the algorithmic study of genome rearrangement by translocation. A translocation exchanges material at the end of two chromosomes within a genome. We model this as a process that exchanges prefixes and suffixes of strings, where each string represents a sequence of distinct markers along a chromosome in the genome. For the general problem of determining the translocation distance between two such sets of strings, we present a 2-approximation algorithm. For a theoretical model in which the exchanged substrings are of equal length, we derive an optimal algorithm for translocation distance.
We also examine for the first time two types of rearrangements in concert. An inversion reverses the order of markers within a substring, and flips the orientation of the markers. For genomes that have evolved by translocation and inversion, we show there is a simple 2-approximation algorithm for data in which the orientation of markers is unknown, and a 3/2-approximation algorithm when orientation is known.
These results take a step towards extending the area from the analysis of simple organisms, whose genomes consist of a single chromosome, and whose evolution has largely involved a single type of rearrangement event, to the analysis of organisms such as man and mouse, whose genomes contain many chromosomes, and whose history since divergence has largely consisted of inversion and translocation events.
We study the problem of comparing two circular chromosomes that have evolved by chromosome inversion, assuming that the order of corresponding genes is known, as well as their orientation. Determining the minimum number of inversions is equivalent to finding the minimum number of reversals to sort a signed circular permutation, where a reversal takes an arbitrary substring of elements and reverses their order, as well as flipping their sign. We show that tight bounds on the minimum number of reversals can be found by simple and efficient algorithms.
We define a new problem in multiple sequence alignment, called maximum weight trace. The problem formalizes in a natural way the common practice of merging pairwise alignments to form multiple sequence alignments, and contains a version of the minimum sum of pairs alignment problem as a special case.
Informally, the input is a set of pairs of matched characters from the sequences; each pair has an associated weight. The output is a subset of the pairs of maximum total weight that satisfies the following property: there is a multiple alignment that places each pair of characters selected by the subset together in the same column. A set of pairs with this property is called a trace. Intuitively a trace of maximum weight specifies a multiple alignment that agrees as much as possible with the character matches of the input.
We develop a branch and bound algorithm for maximum weight trace. Though the problem is NP-complete, an implementation of the algorithm shows we can solve instances on as many as 6 sequences of length 250 in a few minutes. These are among the largest instances that have been solved to optimality to date for any formulation of multiple sequence alignment.
The last few years have seen a surge of interest in computational molecular biology, with massive efforts such as the Human Genome Project underway. Sequence Analysis Primer, edited by Michael Gribskov and John Devereux, is a welcome introduction to computer analysis of biological sequence data. Drs. Gribskov and Devereux are well-known for the Genetics Computer Group sequence analysis package; their volume is not a manual on the GCG package, though it naturally includes many examples from their system. Their stated goal is to explain what types of analysis are available, what their limitations are, and what can or cannot be concluded from the analysis. In this the book succeeds admirably. Contributors have written in a friendly style, and offer plenty of advice to the reader. Perhaps the most appealing aspect of the book is its emphasis on examples drawn from actual use of programs. The final chapter, for instance, is an extended example applying nearly all the preceding material to analysis of a particular sequence, the Drosophila Notch gene.
The book in large measure is directed towards biologists. Readers of this review are likely to be applied mathematicians, statisticians, and computer scientists, hence we try to indicate some of the research problems, and point to some of the more recent work.
...
The DNA sequence in every human being is a text of three billion characters from a four letter alphabet; determining this sequence is a major project in molecular biology. The fundamental task biologists face is to reconstruct a long sequence given short fragments from unknown locations. These fragments contain errors, and may represent the sequence on one strand of the double-helix, or the reverse complement sequence on the other strand. The Sequence Reconstruction Problem is, given a collection F of fragment sequences and an error rate 0 <= e < 1, find a shortest sequence S such that every fragment F in F, or its reverse complement, matches a substring of S with at most e|F| errors.
Sequence Reconstruction is NP-complete. We decompose the problem into (1) constructing a graph of approximate overlaps between pairs of fragments, (2) selecting a set of overlaps of maximum total weight that induce a consistent layout of the fragments, (3) merging the overlaps into a multiple sequence alignment and voting on a consensus. A solution to (1) through (3) yields a reconstructed sequence feasible at error rate 2e/(1-e) and at most a factor 1/(1-e) longer than the shortest reconstruction, given some assumptions on fragment error. We define a measure of the overlap in a reconstruction, show that maximizing the overlap minimizes the length, and that approximating (2) within a factor of a approximates Sequence Reconstruction within a factor of (1-e)a under the overlap measure. We construct the overlap graph for (1) in O(eN2) time given fragments of total length N at error rate e. We develop two exact and two approximation algorithms for (2). Our best exact algorithm computes an optimal layout for a graph of E overlaps and V fragments in O(K(E + V log V)) time, where K <= 2E is the size of the branch-and-bound search tree. Our best approximation algorithm computes a layout with overlap at least 1/2 the maximum in O(V (E + V log V) log V) time. We construct the multiple sequence alignment and consensus sequence for (3) in O(H2L + M + N) time given a layout with at most H mutually overlapping fragments, where L is the length of the alignment and M is the number of pairs of aligned characters in the overlaps.
We evaluate an implementation by comparing the computed reconstruction to the sampled sequence. For a random sequence of length 50,000 sampled at 10% error with 500 fragments of length 500, the software reconstructed the correct layout. For a human DNA sequence of length 50,000 containing 18 repeated elements of length 300 to 2,000 sampled as above at 5% error, the software found a shorter layout, though over 95% of the layout was correct. When covered by three or more fragments, the reconstructed sequence had less than 1 error in 5,000 characters, given input with 1 error in 10.
This is the first treatment of Sequence Reconstruction with inexact data and unknown complementarity.
Multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence-structure relationships. Until recently, it has been impractical to apply dynamic programming, the most widely accepted method for producing pairwise alignments, to comparisons of more than three sequences. We describe the design and application of a tool for multiple alignment of amino acid sequences that implements a new algorithm that greatly reduces the computational demands of dynamic programming. This tool is able to align in reasonable time as many as eight sequences the length of an average protein.