8. Constraints:

Making FAKtory Support Many Sequencing Protocols

One of the unique features of the FAKtory is its ability to automatically generate and utilize a collection of constraints that model additional information about the particular protocol being employed to sequence a target DNA strand. For example, many labs sequence both ends of their clone inserts in which case one knows that the pair of fragment sequences are on opposite strands and should be within a certain distance from each other in any proposed assembly. As another example, in transposon-mapped sequencing, the two fragments primed from a given transposon have a 4 or 5 base-pair overlap and are on opposite strands (we say they have the opposite orientation). Moreover, fragments from adjacent transposons are expected to overlap and again be in the opposite orientation. The FAKII assembly kernel, upon which the FAKtory is based, is the only software suite to provide a mechanism for describing protocol-induced constraints on the range of possible solutions, and to employ it, a priori, in computing its answers. All other systems either ignore such information or tell the users about violations of such constraints, a posteriori.

The basis of the constraint framework is that the additional protocol-dependent information can be expressed as a set of overlap, orientation, and distance constraints between pairs of fragment sequences. While we can't prove that every conceivable protocol can be expressed in such a framework, it is the case that all the protocols commonly in use today are expressible in this framework. The FAKtory has a Constraint Definition panel in which a user can specify and give a symbolic name to a constraint relationship between a pair of fragments. For example, one might define Dual_ends to be the constraint that a pair of fragments be (1) in the opposite orientation, and (2) that their 5' ends be a distance, say 800 to 5000 base pairs, apart. As another example, one might define Nested_deletion to be the constraint that the 3' end of the first fragment overlap the 5' end of the second. In general, a constraint may be any set of overlap, orientation and distance constraints that a user might wish to simultaneously be true about the relationship between two fragment sequences in an assembly. Overlap constraints may be oriented in the sense that the first must be 5' of the second. Moreover, the type of an overlap relationship can be controlled in a detailed way, and distance relationships can be with respect to any anchor position relative to the fragments. Users must also assign a color to every constraint for later display in assemblies.

Now that one can define different types of constraint relationships between fragments, the problem arises as to how one designates that a particular pair of fragments has such a constraint between them. The FAKtory's solution is embodied in the idea of matchers that automatically associate a constraint between two fragments on the basis of their labels. For example, it is common practice to label sequences obtained from opposing ends of an insert something like, Aname.f and Aname.r, and if one later produces a long read of the .f-end of the insert with special chemistry or a different machine, to label that fragment something like Aname.fi. As another example, when doing nested deletion sequencing, one typically labels successive reads as, say Dname.100, Dname.101, Dname.102, and so on. Clearly, with such labeling schemes in place, one should be able to build a simple pattern matching mechanism that identifies appropriately labeled pairs of fragments and associates a given constraint relationship between them.

Such mechanisms can be set up on the Constraint Matcher panel in the following way. One can bind any capital letter, called a variable, to a regular expression. One may then specify a binding template that is a regular expression where some elements may be variables between square brackets, e.g. [A]. Finally, one specifies a mating template that is a string some of whose elements may be the special forms [X], [X+i], or [X+i!]. The variable X must have been used in the binding template, i is an integer constant, and the later two forms are permissible only if X's regular expression matches sequences of digits. A matcher consists of a template pair and a constraint relationship defined previously on the Constraint Definition panel.

When it is time to infer which constraints apply to which pairs of fragments (see Section 9), the mechanism works as follows. The binding template of a matcher is applied to a fragment's label to see if it matches. If it does, then the exact substring of the label matched by each variable in the binding template is determined. A mate label is then generated by replacing every variable reference in the mating template with the corresponding substring from the binding match. In the case of the [X+i] and [X+i!] forms, the integer i is added to the value of the number denoted by the substring, and in the later case, leading 0's are preserved. Once the mate label has been determined, FAKtory searches for a fragment whose label coincides and if found, it then designates this fragment and the one whose label matched the binding template and associates the matcher's constraint to them.

For example, suppose A is bound to [a-zA-Z]* (matches any sequence of letters) and N is bound to [0-9]* (matches any sequence of digits). First, consider the template pair: [A].f and [A].r. The binding template matches Aname.f and the corresponding mate label is Aname.r. Next, consider the template pair: [A].[N] and [A].[N+1]. The binding template matches Dname.100 and the corresponding mate label is Dname.101. Also, the binding template matches Dname.001 and produces mate Dname.2. If one wishes to keep the leading 0's in the mate label then one should use [A].[N+1!] as the mating template, in which case the mate would be Dname.002.

While the FAKtory's matching mechanism is not a universal panacea for the problem of automatically inferring constraint relationships, it is powerful enough to capture a wide range of existing data sets, and certainly more than adequate for future projects where labeling conventions can be chosen to permit its application.