Demultiplexing

3.4 Demultiplexing

Demultiplexing

So far, we have not discussed the issue of how the appropriate path is found for a given message. In many cases, this is trivial. For example, a path is often used like a file descriptor, a window handle, or a socket descriptor. In these cases, paths are created specifically for communicating a certain kind of data and the appropriate path is known from the context in which it is being used. In other cases, however, there is no such context and the appropriate path is implicitly determined by the data itself. The most notable area where this is the case is for the networking subsystem. Networking protocols typically require hierarchical demultiplexing for arriving network packets. For example, when a network packet arrives at an Ethernet adapter [65], the Ethernet type field needs to be looked up to determine whether the packet happens to be, e.g., an IP packet or an ARP packet. If it is an IP packet, it is necessary to look up the protocol field to determine whether it is a UDP or TCP packet, and so on. This hierarchical demultiplexing is suboptimal since as more knowledge becomes available with each demultiplexing step, another path might become more suitable to process that packet.

3.4.1 Scout Packet Classifier

Scout Packet Classifier

To alleviate the hierarchical demultiplexing problem, Scout uses a packet classifier that factors all demultiplexing operations to the earliest possible point. This lets Scout pick a path and start processing a packet only after as many demultiplexing decisions as possible have been made. Of course, a solution that would be better still (as far as Scout is concerned) would be to modify the header of the lowest layer protocol to include a field that can store the path id [56]. That way, demultiplexing could be reduced to a simple table lookup. However, since Scout needs to be able to interoperate with existing networking infrastructure, this is not always an option.

The factoring of demultiplexing decisions is best explained with an example. Suppose a UDP packet were received by an Ethernet adapter. The processing that would occur with hierarchical demultiplexing is shown on the left, with a classifier on the right:

Without Classifier:	With Classifier:49>
Ethernet processing	Ethernet demux => it's an IP packet
Ethernet demux => it's an IP packet	IP demux => it's a UDP packet
IP processing	Ethernet processing
IP demux => it's a UDP packet	IP processing
UDP processing	UDP processing
:	:

As the left half of the table shows, without a classifier, processing would have to occur in a layer by layer fashion since each layer is decoupled by a demultiplexing step (shown in bold). In contrast, the right half of the table shows that with a classifier, all demultiplexing operations are performed first, which makes it possible to execute all protocol processing steps inside a single, long path. Note that the total amount of work is same: there are two demultiplexing steps and three processing steps in both cases. What the table does not show is that the demultiplexing steps are usually much shorter than the processing steps, so the actual time spent in the classifier is typically much smaller than the time spent in the processing steps.

3.4.2 The Role of Classifiers

The Role of Classifiers

The way packet classifiers are used in Scout is unique and therefore worth discussing in some more detail. In particular, their role with respect to dynamic routing decisions is interesting. Consider Figure 15. It would be preferable to process every incoming packet destined for UDP using path p1. Unfortunately, this ideal case cannot always be achieved. For example, the UDP packets may have been fragmented by IP, or IP might have employed a non-trivial data transformation, such as encryption or compression. In all three cases, some protocol headers may not be available to the classifier. Scout copes with this problem by employing short paths as necessary. In the example, an IP fragment would have to be processed first by path p2 and then by path p4, even though this would be less optimal than processing with p1.

Figure 15: Paths Versus Classifiers

The specifics of how the partial classification problem is solved in Scout are explained next. As previously suggested, the Ethernet module (ETH) employs a classifier to decide whether a packet should be processed using path p1, p2, or p3. Suppose an IP fragment that does not contain the higher-level headers arrives at the Ethernet adapter. Without higher-level headers, the best the classifier may be able to do is to determine that path p2 is appropriate for processing the fragment.⁷ Once IP receives the fragment through p2, it will buffer the fragment until the entire datagram has been reassembled. At that point, IP needs to make a dynamic routing decision to find out where to send the complete datagram next. IP can do this by running its own classifier on the complete datagram, which, with a UDP packet, will tell IP that p4 should be used.

This example shows that, in Scout, a classifier is not something specific to network drivers, but instead is a mechanism that can be employed by any module that needs to make a dynamic routing decision based on the contents of the data being communicated. Typically, each network device driver uses a classifier which is invoked when a packet receive interrupt is processed, but other modules may use their own classifiers if necessary.

3.4.3 Realizing the Scout Classifier

Realizing the Scout Classifier

Since Scout is a modular system, we would like to be able to express classification in a modular fashion as well. That way, the complete classifier can be built automatically from partial classification algorithms that are specified by the module that appear along a path. Note that a modular specification of the classifier does not necessarily imply a modular implementation, though this is true for the current Scout implementation.

As discussed earlier, classifiers are used in Scout by any module that may need to make a dynamic routing decision based on the contents of a message. The task of a classifier can therefore be described as: given a module m and a message, determine a path that is appropriate for processing the given message. Since it is module m that is making the routing decision, the path must start at that module.

The classification task can be implemented in an iterative fashion using partial classifiers of the following form: given a set of paths and a message, determine the subset of paths that qualify for the processing of the given message, the module that needs to make the next (refining) classification (if any), and the message for that next module. The message for the next module is normally the same as the original message, except that the protocol header at the front of the message will have been stripped off. This scheme works as long as the classification problem is locally hierarchical. The ramifications of this constraint will be explained in detail in Section 3.4.3.1. For now, we assume that the Scout classification problem satisfies this constraint. Using such partial classifiers, the problem can be solved with the pseudo-code shown in Figure 16.

 1 classify (Module m, Message d):
 2   pset <- Ø
 3   for (p in pset)
 4     if (p.end[0].module == m || p.end[1].module == m)
 5       pset <- pset \cup {p};
 6   path <- NULL
 7   while (m != NULL ^ pset != Ø) do
 8     <m, pset, d> <- m.demux (pset, d)
 9     for (p in pset)
10       if (p.end[0].module == m || p.end[1].module == m)
11         path <- p
12   return path

Figure 16: Classification Pseudo-Code

Lines 2--5 of the pseudo-code determine the set of paths that either begin or end at module m. This path set is stored in variable pset. Then, as long as there is a module m and a set of candidate paths pset, the partial classifier demux of module m is invoked (line 8), passing the current candidate set and message d as parameters. The partial classifier is expected to return the next module m that should refine the classification, a new candidate set pset, and a new message d. If there is no next module, then m will be NULL. Given the new candidate set, the code checks in lines 9--11 whether there is any path p in pset that ends at module m. If there is such a path, it is suitable for processing the original message and is stored in variable path. Scout does not allow ambiguity in classification, meaning that there is at most one path per iteration that is suitable for processing the message.⁸ However, finding a suitable path does not stop the search since there may be a longer path that would be preferable over a shorter one. Thus, the search continues as long as there is a next module and the candidate set is not empty. Once the search stops, line 12 returns the best path found or NULL, if no suitable path exists.

We observe that the classification process can be viewed as a tree traversal where the partial classifiers are used to select the next node to be visited and where each node corresponds to a path set. This tree-traversal can be implemented efficiently. Note that the search tree contains one node per module and per path set. In Scout, demultiplexing nodes have the following structure:

typedef struct {
  long   numPaths;
  Path   path;
  Module nextModule;
} * DemuxNode;

Integer numPaths gives the size of the set of paths represented by this node. The pointer path of a demux node n is NULL if no path exists that could handle a message that reaches node n during classification. If it is non-NULL, it points to the longest path that can be used for a message that reaches node n during classification. Pointer nextModule either refers to the next module that should be used to refine the classification or is NULL if there is no such module. The latter case implies that there is no path that extends beyond the module represented by the demux node and that the path referred to by path should be used to process the message.

Given the definition of a demux node, the partial classifiers are functions with a prototype shown below (Msg is the Scout type to represent a message, i.e., a sequence of data bytes):

Path (*DemuxFunc) (DemuxNode pset, Msg m);

Note that the return value is a path. This is because each partial classifier will invoke the partial classifier of the next module, if necessary. Thus, the tree traversal loop is implicitly implemented as a sequence of nested calls. Also note that the partial classifier function of each module is stored in the module object (see Section 3.2.3.2). When a new module object is created, the module must initialize member demux in the object to the module's partial classifier function.

The next question is how the tree of demux nodes is created and maintained. Suppose a new path p has been created. The first stage s in this path is p->end[0]. If we assume that the module at which the path starts uses a hash table h to implement the partial classifier function and that the module's partial key for path p is k, then the first module in the path can update its demux node by calling function updateDemuxNode passing s, h, k and NULL as arguments in this order. The updateDemuxNode function is illustrated with the pseudo-code shown in Figure 17.

 1 void
 2 updateDemuxNode (Stage s, HashTable h, Key k, DemuxNode pn) {
 3   dk = concat (pn, k);
 4   n = hashLookup (h, dk);
 5   if (!n) {
 6     n = new (DemuxNode);
 7     n->numPaths = 0;
 8     n->path = NULL;
 9     n->nextModule = next module in path;
10     hashEnter (h, dk, n);
11   }
12   if (s == s->path->end[1])
13     n->path = s->path;
14   else if (pn)
15     n->path = pn->path;
16   ++n->numPaths;
17 }

Figure 17: Update of Demux Tree

In line 3, the function concatenates the bits of the previous demux node with those for the partial key k. Since this is the first stage in the path, there is no previous demux node (pn is NULL), so dk equals k. In line 4, the demux node is looked up using the demux key. If no demux node exists for that key yet, lines 6--10 create a new node and enter it in the hash table. In lines 12--15, the path member of the node is updated if necessary. Specifically, if stage s is the last stage of the path, then path p can be used for any message that gets classified at least to node n. If the end of the path has not been reached yet, n->path is simply set to the path of the previous node, or left unmodified if there is no previous node. In line 16, the number of paths in the path set represented by n is incremented.

A demux node is normally updated during the establish callback of a stage. Note that the same routine can be used to update the demux nodes for all stages in a newly created path. The only difference compared to the first stage is that pn, the previous node pointer is no longer NULL. This pointer is passed as attribute PREVIOUS_DEMUX_NODE in the attribute set passed to each stage. Note that, in general, classification can be initiated at either end of the path. Thus, when descending in the establish callback chain, this attribute points to the previous demux node with respect to classification in the forward direction, but once the end of the path is reached, the attribute changes its meaning into representing the previous node with respect to classification in the backward direction.

Note that the updateDemuxNode function works properly even in the case of a module that does not perform demultiplexing in the given direction. In such a case, the key k simply has zero length. While the function works properly, this special case can be optimized by simply updating the nextModule pointer of the previous node to the next module along the path. This has the effect that modules that do not perform classification in the given direction are skipped.

With this setup, a partial classification function is typically implemented similar to that shown in Figure 18. First, line 2 extracts the partial demux key from the message (typically from a header) and stores it in k. Then the bits of the previous demux node and the partial key are concatenated and looked up in the hash table. If an appropriate demux node exists for the message, the classification is refined by calling the partial classifier function of the next module. If that classification succeeds, the path found is returned. Otherwise, the path associated with demux node n is returned (which may be NULL if no suitable path exists).

 1 Path demux (DemuxNode pn, Msg m) {
 2   extract key k from m;
 3   dk = concat (pn, k);
 4   n = hashLookup (h, dk);
 5   if (n) {
 6     if (n->nextModule)
 7       path = n->nextModule->demux (n, msg);
 8       if (path)
 9         return path;
10     return n->path;
11   }
12   return NULL;
13 }

Figure 18: Typical Partial Classifier

3.4.3.1 Globally Hierarchical Classification

Globally Hierarchical Classification

As alluded to before, the classification scheme as described so far works for locally hierarchical classification problems. By this we mean that each partial classifier can make its decision in isolation. Unfortunately, for real networking protocols, the problem is not limited to this special case. It is hierarchical, but only at the global level.

Figure 19: Example Requiring Global Hierarchical Classification

This is illustrated in Figure 19, which shows the IP and UDP header fields that are used for classification purposes. IP uses the protocol field, and the local and remote addresses. In the figure, these fields are labelled protl, laddr, and raddr respectively. For UDP, the fields used are the local and remote port number; lport and rport in the figure. For paths that represent actively opened network connections (active paths), all five fields need to be matched. On the other hand, for paths representing passively opened connections (passive paths), only the protocol field and local port field need to be matched. This means that the classification process is still hierarchical, but only at the global level.

protl	laddr	raddr	lport	rport		maps to:
<UDP,	any	any	21,	any>	=>	path 1 (passive)
<UDP,	192.12.69.168,	18.181.0.21,	21,	3949>	=>	path 2 (active)

Table 3: Example Mappings

Suppose a system has the mappings shown in Table 3 in place. If a packet with the fields <protl=UDP, laddr=192.12.69.168, raddr=18.181.0.21, lport=21, rport=3950> were to arrive, then IP could not make a proper decision as to whether the path subset it passes to UDP should include both the active path and the passive path or only the former. If it were to pass on the active path only, classification would erroneously fail, since there is no active path matching the packet. On the other hand, if it were to pass on both paths, then classification would be erroneously ambiguous for packets that belong to the active path. That is, whether IP should use the fields for an active path or the a passive path depends on the classification decision of UDP. Since IP cannot know how UDP's partial classifier works, it may make the wrong choice no matter how careful it is.

To solve this problem in general, it is necessary to try to find a path with the most specific keys first and, if that fails, backtrack and try the same with a less specific key. For example, networking protocols such as IP that distinguish between active and passive keys, could use logic similar to the one shown in Figure 20.

1 path <- try_active_demux (pn, m);
2 if (!path)
3   path <- try_passive_demux (pn, m);

Figure 20: Generalized Partial Classifier

In the figure, functions try_active_demux and try_passive_demux correspond to classifiers as shown in Figure 18 that use the active and passive keys, respectively. With this kind of backtracking in place, it is guaranteed that the correct combination of keys is eventually found and therefore classification is guaranteed to find the correct path, if it exists.

Unfortunately, adding backtracking causes classification complexity to increase from O(N) to O(2^N), where N is the number of partial demux functions that require backtracking. Even though this involves exponential complexity, it might be quite practical since the number of demultiplexing layers is rarely bigger than four or five (not all of which may require backtracking). Better still, for networking protocols backtracking can be limited in a way that allows bringing back classification to linear time complexity. Observe that for the vast majority of systems, it is acceptable to limit paths to use either all active or all passive keys.⁹ In this limited scenario, classification can be achieved by first trying to find an active path. If an active key lookup anywhere along the chain of partial classifiers fails, then classification can be aborted and retried with passive keys only. Thus, in the worst case, 2N classification decisions have to be made, where N is the number of modules that are traversed by a path. Note that this scheme does not force limited backtracking in every module. If there are any modules that require richer backtracking (e.g., because there are more than two possible choices), then they can still do so. In other words, this scheme reduces classification overhead to linear time complexity for the common case yet is not limited to locally hierarchical classification.

3.4.3.2 Classification in Non-Demultiplexing Modules

Classification in Non-Demultiplexing Modules

Most demultiplexing occurs in modules implementing networking protocols, but it certainly is desirable to be able to connect modules implementing networking protocols and other modules without any bad consequences (as long as the module graph is sensible). For example, a filter that converts MPEG-encoded video into a sequence of images should work properly independent of whether its input is fed from a disk or the network. Similarly, it should be possible to insert a filter (e.g., one that counts the number of bytes that pass through it) between a pair of networking protocols (e.g., TCP and IP) and have the system operate as expected.

This kind of support requires answering how demultiplexing needs to be interpreted with respect to modules that do not perform any demultiplexing themselves. Fortunately, this is straight-forward in Scout: a module simply needs to ensure that its partial classifier applies the same data transformations as the actual processing. If the data is not transformed at all (such as in the byte-counting filter mentioned previously), then the partial classifier would not have to do anything either. On the other hand, if a module modifies all data passing through it---e.g., by complementing it---then the partial classifier would have to do the same. Of course, if the operation is complex or computationally expensive (such as encryption), then the module might prefer not to let classification proceed beyond itself. A module can do this by blocking the PREVIOUS_DEMUX_NODE attribute during the establish callback. When a module does this, it implies that the earlier modules have to be able to make an accurate enough classification so that a unique path can be found even without the help of the partial classifiers in later modules.

3.4.4 Evaluation

Evaluation

To gain a better understanding of the Scout classifier, we compare it to other existing packet classifiers. The two classifiers with the best published performance are DPF [30] and PathFinder [5]. Unfortunately, the existing implementation of PathFinder is not 64-bit clean and even on 32-bit systems there appear to be problems that prevent it from working reliably. For this reason, the remainder of this discussion is limited to a comparison with DPF. The available DPF implementation is version 2.0 beta, which is a re-implementation of the version presented in [30]. The measurements for this comparison were performed in a user-level test environment running on Linux on an AlphaStation 600/333. This machine uses a 21164 Alpha chip running at 333MHz. All measurements were done with warm caches. Measurements not reported here indicated that the slowdown due to cold caches is significant, but comparable for the two versions and therefore of no particular interest to this discussion. To make the comparison with DPF more meaningful, the Scout classifier was implemented assuming messages are represented as a linear sequence of bytes (the Scout kernel uses a more general tree of linear byte sequences instead).

	Scout	DPF v2.0
Expressiveness	universal	limited
Trust assumed	yes	no
Static size of code and data [byte]	4,901	122,311
Lookup active path [µs]	0.75	0.56
Lookup passive path [µs]	1.45	0.65
Insert passive path [µs]	3.70	24.2
Remove passive path [µs]	1.02	3.78
Bytes per filter	56	8,440

Table 4: Summary of Scout Classifier and DPF Comparison

Table 4 presents a summary of the two classifiers. The row labelled Expressiveness shows that the Scout classifier is universal. This is because the partial demux functions are written in ordinary C and because an exhaustive tree search can be performed if necessary. In contrast, DPF version 2 filters can make use of only two kinds of operators. The first kind supports testing a bitfield in the message contents for equality with either a constant or another bitfield in the message. The second kind of operators support dropping bytes from the front of the message. Both dropping by a constant or a variable number of bytes is supported. A filter can achieve the latter by specifying a bitfield in the message contents that determines the possibly scaled number of bytes to drop. Unfortunately, this feature did not work properly for the Alpha architecture in the beta version of DPF v2.0 that was available at the time of this writing. For this reason, the TCP/IP filters used in the measurements assumed fixed size headers (which is not realistic, as both TCP and IP support header options).

The row labelled Trust assumed shows that the Scout classifier assumes the partial demux functions are trusted. To supported untrusted demux functions, Scout would have to employ software fault-isolation [109] to sandbox the possibly malicious code. In contrast, the simple DPF filter are provably safe and hence can be supplied by untrusted users.

The third row, labelled Static size of code and data compares the static size of the code and data sections required by the two classifiers. This was measured by counting the number of bytes by which the size of test program increases when linking in one or the other classifier. In the DPF case, this includes the size of the vcode dynamic code generator, which accounts for about 74KB of the reported size [31].

The next four rows lists various performance aspects. The row labelled Lookup active path lists the time required to classify a TCP/IP packet. To provide for a realistic environment, the classification was performed with three filters installed: one for the TCP/IP packets being looked up and one each for ARP and ICMP packets [79, 81]. The table shows that DPF is about 25% faster than the Scout classifier. This may sound like a large difference, but it corresponds to 62 CPU cycles which is marginal compared to the cost of fielding an interrupt, which can easily be several hundred cycles. The next row, labelled Lookup passive path, shows the 2N behavior of the Scout classifier discussed earlier. The data show that classifying a packet for a passive path is almost twice as expensive as for an active path. In contrast, DPF classification overhead increases by 16% only. Since passive paths are typically used to initiate path creation, the slower Scout time is not expected to significantly affect overall performance. It is also interesting to study classification performance as a function of the number of active TCP/IP filters in the classifier. This is illustrated in Figure 21. It shows that the Scout classification time remains largely unaffected by the number of active filters. The slightly larger time when two filters are active is a cache effect. DPF optimizes the case where just one filter is active, but uses a hash table for all other cases. Hence, classification time is slightly better with just one filter in place and remains largely constant when two or more filters are active. In Table 4, the rows labelled Insert passive path and Remove passive path show the time it takes to insert and remove a filter for a passive TCP/IP path. Insertion is more than six times slower for DPF than for Scout, which is due to the fact that DPF has to perform fairly sophisticated analysis of the newly inserted filter and because it requires dynamic code generation. It should be noted, however, that filter insertion/removal is typically less frequent than packet classification.

Figure 21: Classification Performance as a Function of Number of Filters

The last row, labelled Bytes per filter, shows the average size of the memory allocated dynamically per TCP/IP filter. In the Scout case, adding a TCP/IP filter requires just one demux node of 56 bytes if other TCP/IP filters are already installed. In the worst case, adding a TCP/IP filter to an empty classifier requires four demux nodes or 224 bytes. In contrast, DPF requires about 6-8KB per filter. This is because DPF uses a large, statically sized array to represent filters. Simple filters leave most of this array unused but, on the other hand, using a large array keeps the number of dynamic memory allocation requests to a minimum.

In summary, the comparison shows that the Scout classification approach is performance competitive with other classifiers and at the same time completely general. However, the above measurements should be considered optimistic, since the user-level test program makes some simplifying assumptions and since it assumes warm caches. For example, measurements with an actual Scout kernel indicate that, on an older 21064A Alpha clocked at 300MHz, classifying a TCP packet takes about 3.5µ when the instruction cache is warm, and about 7.2µs when the instruction cache is cold. In the warm-cache case, 0.8µs are spent on initializing and destroying the message data structure that Scout uses for the classification process, so the actual classification process takes 2.7µs. On the same machine, the test program reports a time of 1.14µs. This difference is most likely caused by three factors. First, the kernel implementation uses the normal Scout message abstraction instead of a simple linear byte sequence. Second, the kernel uses a more general and therefore more heavy-weight hash-table implementation and, third, the partial demultiplex functions are more widely scattered in the address space than in the test program. There is no reason these issues could not be addressed, but it is not clear whether doing so would result in an overall performance improvement that would make this worthwhile.