7. Prescreeners:

A Unified Framework for Clipping, Screening, and Tagging

The pipeline must always consist of an initial Input stage that imports fragment records from the file system, an Overlap stage which computes all overlaps between fragment sequences, and a final Assembly stage that melds the fragments into a reconstruction of the target. Between the Input and Overlap stages may be any number and combination of Prescreener stages. In the design of FAKtory, we chose to develop a single, unifying framework in which one could formulate a wide range of criteria and recipes for clipping, screening, and tagging fragment sequences. This framework involves a set of five types of pattern recognizers and a small expression language for flexibly combining the results of these recognizers.

We start by describing such a general-purpose Prescreener stage whose configuration panel presents the user with the full power of the framework. A Prescreener stage consists of several prescreeners, each of which can be programmed to either cut off a 5'- or 3'-end of a fragment's sequence, or to tag substrings of the sequence with a specifiable color and symbolic name. The interval(s) of a fragment's sequence which will be clipped or tagged by a presceener are specified by an interval expression which is basically a pattern that matches a set of disjoint substrings, specified as intervals of character positions. We will describe interval expressions and the intervals they match in a bottom up fashion by starting with the simplest:

An interval expression can be a recognizer. Each recognizer matches either a single interval or a collection of disjoint intervals of positions in the fragment sequence to which it is being applied. There are five types of recognizers:
- An interval recognizer is just a fixed interval [I,J]. For example, [0,20] matches the first 20 bases of a sequence, [-20,-0] matches the last 20 bases. One may also specify this in percentage terms, so [0,20]% matches the first 20% of the bases in a sequence.
- A regular expression recognizer is a regular expression, an error tolerance, and a designation that one wants the 3'-most, 5'-most, or all matches to the expression. The recognizer returns an interval or intervals that match the regular expression within the given number of errors. It is useful for short patterns such as restriction enzyme cut sites.
- There are several signal recognizers that match intervals of a sequence based on the measure of the signal/noise ratio, peak-height/max-height ratio, and peak width in a window of specifiable length.
- A frequency recognizer matches intervals in which the frequency of specifiable bases (including N's) is above or below a given level in a window of a given length.
- An overlap recognizer matches any intervals that overlap, within a certain match stringency, a reference sequence from a user-specified library of such sequences. The library typically contains things such as consensus Alu and Line elements, and vectors such as variants of PUC commonly used in the lab. These recognizers are useful for tagging repeats and identifying vector sequence that needs to be trimmed.
These base recognizers are configured in a Recognizers sub-panel to each Prescreener panel that is dedicated to that purpose. Each recognizer is given a name so that it can then be referred to in an interval expression for a prescreener.
Any set-theoretic combination, X op Y, where X and Y are interval expressions, matches the appropriate combinations of the intervals matched by X and Y. The operator symbols used are | for union, & for intersection, and - for minus. Also !X matches the complement of X's intervals with respect to the fragment sequence.
The expression X+c, where X is an interval expression and c is an integer constant, matches the intervals matched by X all shifted in the 3' direction by c positions. The expression X-c similarly shifts X's intervals in the 5' direction.
The expression [X,Y], matches the interval that starts at the 5'-end of 5'-most interval matched by X and ends at the 3'-end of 3'-most interval matched by Y. If X doesn't match anything then the expression is equivalent to [Y,Y], if Y doesn't match anything, to [X,X], and if both don't match anything then the expression doesn't match anything. One may also specify open intervals at either end by replacing [ with (, and ] with ).
The expression X ? Y : Z matches Y if X matches something and Z otherwise. Similarly X ? Y matches Y if X is matches something and doesn't match anything otherwise, and X : Z matches Z if X doesn't match anything and matches X otherwise.
The expression X(Y) matches the everything matched by X when evaluated over the substrings of the fragment's sequence matched by Y.

This simple interval expression language is sufficient to describe quite complex clipping or tagging criteria. For example, if one wanted to clip at the clone insertion restriction site, or at the 50th base if such a site cannot be found because of poor signal quality in the initial part of the read, then one can express this with the interval expression [ 0 , Site(Intv) : Intv ] where Intv is the interval recognizer [0,50], and Site is a regular expression recognizer for the cut site with say 1 mismatch allowed and optioned to return the 5'-most instance.

In designing the general Prescreener-type stages above, we again came up against the problem of the desire for generality resulting in a mechanism that required significant skill to utilize. Often, however, the full power and concomitant complexity of the full framework is not needed. To alleviate this problem, we set about designing simpler, specialized interfaces called Clip, Screen, and Tag stages that are sub-classes of prescreeners directly suited to expressing common clipping, vector screening, and element tagging functions. We give a quick overview of each of these special panels:

Vector: This panel is restricted to building a set of clipping prescreeners each of which is a single overlap recognizer. The design and layout of the panel have been tailored for this simple subset of the prescreener capability. In essence, the user is presented with a panel where they select the reference vector sequences they wish to screen out.
Tag Panels: This panel is restricted to building a set of tagging prescreeners, each of which is a single regular expression, frequency, or overlap recognizer that is automatically optioned to report all matches. Like the other panels its design and layout are tailored to present a simple interface to this subclass.
Clip Panels: This panel is restricted to building a set of 5' clipping prescreeners and a set of 3' clipping prescreeners each of which is a single interval, regular expression, signal, or frequency recognizer that is automatically optioned for matching the 3'- or 5'-most occurrence, respectively. The 5' clipping prescreener is guaranteed to clip from the start of the fragment sequence to the 3' end of its recognizers match (if any). Symmetrically, the 3' clipping prescreener clips from the end of the fragment to the 5' end of its recognizers match (if any).

We find that in practice these simple sub-classes suffice to express most of the preprocessing needed on fragment sequences before computing overlaps between them and then assembling them into contigs. Only on occasion is the full power of interval expressions required. As a final note, every clip, tag, and vector panel can be viewed as a general prescreener if desired. The prescreeners therein may then be modified using the more powerful console of the Prescreener panel. One may always flip back to the original subclass, providing the specification has not changed. This permits users to learn about the Prescreener by seeing how Clip, Vector, and Tag specifications are codified as Prescreener specifications.