Latency Reducing Techniques

3 Latency Reducing Techniques

This section describes three techniques that we evaluated as a means to reduce protocol-processing latency. Unlike many other optimization techniques that improve execution speed by reducing the number of instructions executed, these techniques are primarily targeted at reducing the cost for each instruction.

3.1 Outlining

Outlining

As the name suggests, outlining is the opposite of inlining. It exploits the fact that not all basic blocks in a function are executed with equal frequency. For example, error handling in the form of a kernel panic is clearly expected to be a low-frequency event. Unfortunately, it is rarely possible for a compiler to detect such cases based only on compile-time information. In general, basic blocks are generated simply in the order of the corresponding source code lines. For example, the sample C source code shown on the left is often translated to machine code of the form shown on the right:

                          :
    :                   load      r0,(bad_case)
 if (bad_case) {        jump_if_0 r0,good_day
   panic("bad day");    load_addr a0,"bad day"
 }                      call      panic
 printf("good day");  good_day:
    :                   load_addr a0,"good day"
                        call      printf
                          :

The above machine code is suboptimal for two reasons: (1) it requires a jump to skip the error handling code, and (2) it introduces a gap in the i-cache if the block size is larger than one instruction. A taken jump often results in pipeline stalls and i-cache gaps waste memory bandwidth because useless instructions are loaded into the cache. This can be avoided by moving error handling code out of the main line of execution, that is, by outlining error handling code. For example, error handling code could be moved to the end of the function or to the end of the program.

Outlining traditionally has been associated with profile-based optimizers [12, 26]. Profile-based optimizers are aggressive rather than conservative---any code that is not covered by the collected profile will be outlined. They also make it difficult to map the changes back to the source code level, so it is not easy to verify that a collected profile is indeed (sufficiently) exhaustive. Finally, relatively simple changes to the source code may require collecting a new profile all over again. The main advantage of profile-based optimizing is that it can be easily automated.

Due to the above drawbacks and the unique opportunities present in networking and low-level systems code, our outlining approach is language-based and conservative. Being conservative, it may miss outlining opportunities and be less effective than a profile-based approach. However, systems code is unique in that it contains much code that can be outlined trivially. For example, it is not uncommon to find functions that contain up to 50% error checking/handling code. Just outlining these obvious cases can result in dramatic code-density improvements. Since the approach is language based, the source code explicitly indicates what portions of the code get outlined. That is, outlining gives full control to the systems programmer.

We modified the GNU C compiler such that if-statements can be annotated with a static prediction as to whether the if-conditional will mostly evaluate to TRUE or FALSE. If-statements with annotations will have the machine code for the unlikely branch generated at the end of the function. Un-annotated if-statements are translated as usual. With this compiler-extension, the code on the left is translated into the machine code on the right:

                           :
    :                    load      r0,(bad_case)
 if (bad_case @ 0) {     jump_if_not_0 r0,bad_day
    panic("bad day");    load_addr a0,"good day"
 }                       call      printf
 printf("good day");   continue:
    :                      :
                         return_from_function

                       bad_day:
                         load_addr a0,"bad day"
                         call      panic
                         jump      continue

Note that the if-conditional is followed by an @0 annotation. This tells the compiler that the expression is expected to evaluate to FALSE most of the time. In contrast, the annotation @1 would mark a mostly-TRUE expression. For portability and readability, these annotations are normally hidden in C pre-processor macros.

The above machine code avoids the taken jump and the i-cache gap at the cost of an additional jump in the infrequent case. Corresponding code will be generated for if-statements with an else-branch. In that case, the static number of jumps remains the same, however. It is also possible to use if-statement annotations to direct the compiler's optimizer. For example, it would be reasonable to give outlined code low-priority during register allocation. Our present implementation does not yet exploit this option.

As alluded to before, outlining should not be applied overly aggressively. In practice, we found the following three cases to be good candidates for outlining:

Error handling. Any kind of expensive error handling can be safely outlined. Error handling is expensive, for example, if it requires a reboot of the machine, console I/O, or similar actions.
Initialization code. Any code along the critical path of execution that is executed only once (e.g., at system startup) can be outlined.
Unrolled loops. The latency sensitive case usually involves so little data processing that unrolled loops are never entered. If there is enough data for an unrolled loop to be entered, execution time is typically dominated by data-dependent costs, so that the additional overheads due to outlining are insignificant.

We found that outlining alone does not make a huge difference in end-to-end latency. However, the code density improvements that it achieves are essential to the effectiveness of the next two techniques: cloning and path-inlining.

3.2 Cloning

Cloning

Cloning involves creating a copy of a function. The cloned copy can be relocated to a more appropriate address and/or optimized for a particular use. For example, if the TCP/IP path is executed frequently, it may be desirable to pack the involved functions as tightly as possible. The resulting increase in code-density can improve i-cache, TLB, and paging behavior. The longer cloning is delayed, the more information is available to specialize the cloned functions. For example, if cloning is delayed until a TCP/IP connection is established, most connection state will remain constant and can be used to partially evaluate the cloned function. This achieves similar benefits as code synthesis [17]. Just as for inlining, cloning is at odds with locality of reference. Cloning at connection creation time will lead to one cloned copy per connection, while cloning at protocol stack creation time will require only one copy per protocol stack. By choosing the point at which cloning is performed, it is possible to tradeoff locality of reference with the amount of specialization that can be applied.

Figure 2:Effects of Outlining and Cloning

Cloning can be considered the next logical step following outlining---the latter improves (dynamic) instruction density within a function, while the former achieves the same across functions. Figure 2 summarizes the effects that outlining and cloning have on the i-cache footprint. The left column shows many small i-cache gaps due to infrequently executed code. As shown in the middle column, outlining compresses frequently executed code and moves everything else to the end of the function. The right column shows that cloning leads to a contiguous layout for clone A and clone B. Note that this particular example assumes that the clones and the original functions can share the outlined code. Whether this is possible depends on architectural details. For the Alpha, sharing is normally possible. Where sharing is not possible, cloning places a copy of the outlined code behind all the frequently executed code.

We implemented runtime cloning as a means to allow flexible experimentation with various function positioning algorithms. Cloning currently occurs when the system is booted (not when a connection is established) and supports only very simple code specialization. Code specialization is specific to the Alpha architecture and is targeted at reducing function call overheads. In particular, under certain circumstances, the Alpha calling convention allows us to skip the first few instructions in the function prologue. Similarly, if a caller and callee are spatially close, it is possible to replace a jump to an absolute address with a PC-relative branch. This typically avoids the load instruction required to load the address of the callee's entry point and also improves branch-prediction accuracy.

We experimented extensively with different layout strategies for cloned code. We thought that, ideally, it should be possible to avoid all i-cache conflicts along a critical path of execution. With a direct-mapped i-cache, the starting address of a function determines exactly which i-cache blocks it is going to occupy [19]. Consequently, by choosing appropriate addresses, it is possible to optimize i-cache behavior for a given path. The cost is that this fine-grained control of function-placement occasionally makes it necessary to introduce gaps between two consecutive functions (sometimes it is possible to fill a gap with another function of the appropriate length). Gaps have the obvious cost of occupying main memory without being of any direct use. More subtly, if i-cache blocks are larger than one instruction, fetching the last instructions in a function will frequently result in part of a gap being loaded into the i-cache as well, thereby wasting memory bandwidth.

We devised a tool employing simple heuristics that, based on a trace-file, computed a layout that minimizes replacement misses without introducing too many additional gaps. We call this approach micro-positioning because function placement is controlled down to size of an individual instruction. I-cache simulation results were encouraging---it was possible to reduce replacement misses by an order of magnitude (from 40, down to 4), while introducing only four or five new cold misses due to gaps.

However, when performing end-to-end measurements, a much simpler layout strategy consistently outperformed the micro-positioning approach. The simpler layout strategy achieves what we call a bipartite layout. Cloned functions are divided into two classes: path functions that are executed once and library functions that are executed multiple times per path invocation. There is very little benefit in keeping path functions in the cache after they executed, as there is no temporal locality unless the entire path fits into the i-cache. In contrast, library functions should be kept cached starting with the first and ending with the last invocation. Based on these considerations it makes sense to partition the i-cache into a path partition and a library partition. Within a partition, functions are placed in the order in which they are called. Such a sequential layout maximizes the effectiveness of prefetching hardware that may be present. This layout strategy is so simple that it can be computed easily at runtime---the only dynamic information required is the order in which the functions are invoked. In essence, computing a bipartite layout consists of applying the well-known ``closest-is-best'' strategy to the library and path partition individually [26].

Establishing the performance advantage of the bipartite layout relative to the micro-positioning approach is difficult since small changes to the heuristics of the latter approach resulted in large performance variations. The micro-positioning approach usually performed somewhat worse than a bipartite layout and sometimes almost equally well, but never better. It is not entirely obvious why this is so and it is impossible to make any definite conclusions without even more fine-grained simulations, but we have three hypotheses. First, micro-positioning leads to a non-sequential memory access pattern because a cloned function is positioned wherever it fits best, that is, where it incurs the minimum number of replacement misses. It may be this nearly random access pattern that causes the overall slowdown. Second, the gaps introduced by the micro-positioning approach do cost extra memory bandwidth. This hypothesis is corroborated by the fact that we have not found a single instance where aligning function entry-points or similar gap-introducing techniques would have improved end-to-end latency. Note that this is in stark contrast with the findings published in [11], where i-cache optimization focused on functions with a very high degree of locality. So it may be that micro-positioning suffers because of the memory bandwidth wasted on loading gaps. Third, the DEC 3000/600 workstations used in the experiments employ a large second-level cache. It may be the case that the initial i-cache misses also missed in the second-level cache. On the other hand, i-cache replacement misses are almost guaranteed to result in a second-level cache hit. Thus, it is quite possible that 36 replacement misses were cheaper than four or five additional cold misses introduced by micro-positioning.

Despite the unexpected outcome, the above result is encouraging. In order to improve i-cache performance, it is not necessary to compute an optimal layout---a simple layout-strategy such as the bipartite layout appears to be just as good (or even better) at a fraction of the cost. We would like to emphasize that the bipartite layout strategy may not be appropriate if all the path and library functions can fit into the i-cache. If it is likely that the path will remain cached between subsequent path-invocations, it is better to use a simple linear allocation scheme that allocates functions strictly in the order of invocation, that is, without making any distinction between library and path functions. This is a recurrent theme for cache-oriented optimizations: the best approach for a problem that fits into the cache is often radically different from the best one for a problem that exceeds the cache size.

3.3 Path-Inlining

Path-Inlining

The third latency reducing technique is path-inlining. This is an aggressive form of inlining where the entire latency-sensitive path of execution is inlined to form a single function. Since the resulting code is specific for a single path, this is warranted only if the path is executed frequently. It is important to limit inlining to path-functions: by definition, library-functions are used multiple times during a single path executions, so it is better to preserve the locality of reference that they afford. Also, inlining library-functions would likely lead to an excessive growth in code size.

The advantage of path-inlining is that it removes almost all call overheads and greatly increases the amount of context available to the compiler for optimization. For example, in the x-kernel's VNET protocol, output processing consists of simply calling the next lower layer's output function. With path-inlining, the compiler can trivially detect and eliminate such useless call overheads.

While path-inlining is easy in principle, the practical problem is quite difficult. None of the common C compilers are able to inline code across module boundaries (object files). There are tools available that assist in doing so, but the ones that we experimented with were not reliable enough to be of much use. While it should not be very difficult to add cross-module inlining to an existing C compiler, in our case it appeared more effective to apply the require transformations manually.

We applied path-inlining to both the TCP/IP and the RPC stacks. In the TCP/IP case, this resulted in collapsing the entire stack into two large functions: one for input processing and one for output processing. Roughly the same applies for the RPC stack, although the split is slightly different: one function takes care of all the processing in protocols XRPC-TEST, MSELECT, VCHAN as well as the output processing in CHAN and the protocols below it, whereas the other function handles all input processing up to the CHAN protocol.

3.4 Support for Paths

Support for Paths

The second two techniques described in this section---cloning and path-inlining---depend on knowing the exact sequence of functions that a given packet is going to traverse. While this is usually the case for outbound packets, knowing the path an incoming packet is going to follow is problematic. This is because traditional networking code discovers the path of execution incrementally and as part of other protocol processing: a protocol's header contains the identifier of the next higher-level protocol, and this higher-level protocol identifier is then mapped into the address of the function that implements the appropriate protocol processing. In other words, processing incoming packets involves indirect function calls that cannot be inlined in general.

To make cloning and path-inlining work for this important case, it is necessary to assume that a packet will follow a given path, generate path-inlined and cloned code for that assumed path, and then, at run-time, establish that an incoming packet really will follow the assumed path. We have designed and implemented a new OS, called Scout, that extends the x-kernel by adding an explicit path abstraction [21]. An essential component of Scout is a packet classifier that determines which path a given packet will traverse [2, 18, 33, 9]. Thus, such a system provides exactly the information necessary to use cloning and path-inlining.