This section describes four techniques to reduce protocol-processing latency. Unlike other optimizations that improve execution speed by reducing the number of instructions executed, these techniques are primarily targeted at reducing the average cost of each instruction. That is, they attempt to reduce the average CPI. The first three techniques rely on the path abstraction, whereas the fourth technique is a purely compiler-based technique that is also aimed at improving predictability of execution time.
As the name suggests, outlining is the opposite of inlining. It exploits the fact that not all basic blocks in a subroutine (function) are executed with equal frequency. For example, error handling in the form of a kernel panic is clearly expected to be a low-frequency event. Unfortunately, it is rarely possible for a compiler to detect such cases based only on compile-time information. In general, basic blocks are generated simply in the order of the corresponding source code lines. For example, the sample C source code shown on the left is often translated to machine code of the form shown on the right:
The machine code on the right hand side is suboptimal for two reasons: (1) it requires a jump to skip the error handling code, and (2) it introduces a gap in the i-cache if the block size is larger than one instruction. A taken jump often results in pipeline stalls and i-cache gaps waste memory bandwidth because useless instructions are loaded into the cache. This can be avoided by moving error handling code out of the main line of execution, that is, by outlining error handling code. Such outlined code could, for example, be moved to the end of the function or to the end of the program where it does not interfere with any frequently executed code.
: : load r0,(bad_case) if (bad_case) { jump_if_0 r0,good_day panic ("bad day"); load_addr a0,"bad day" } call panic printf ("good day"); good_day: : load_addr a0,"good day" call printf :
Outlining is sometimes used with profile-based optimizers [47, 77]. With profile information, outlining can be automated relatively easily. However, for the purpose of experimentation, it is more appropriate to use a language based approach that gives full and direct control to the programmer. Thus, we modified the GNU C compiler such that if-statements can be annotated with a static prediction as to whether the if-conditional will mostly evaluate to true or false. If-statements with annotations will have the machine code for the unlikely branch generated at the end of the function. Un-annotated if-statements are translated as usual. With this compiler-extension, the code on the left is translated into the machine code on the right:
: : load r0,(bad_case) if (bad_case @ 0) { jump_if_not_0 r0,bad_day panic("bad day"); load_addr a0,"good day" } call printf printf("good day"); continue: : : return_from_function bad_day: load_addr a0,"bad day" call panic jump continue
Note that the if-conditional is followed by an @0 annotation. This tells the modified compiler that the expression is expected to evaluate to false most of the time. In contrast, the annotation @1 would mark a mostly-true expression. For portability and readability, such annotations are normally hidden in C pre-processor macros called PREDICT_FALSE and PREDICT_TRUE.
The above machine code avoids the taken jump and the i-cache gap at the cost of an additional jump in the infrequent case. Corresponding code is generated for if-statements with an else-branch. In that case, the static number of jumps remains the same, however. It is also possible to use such annotations to direct the compiler's optimizer. For example, it would be reasonable to give outlined code lower priority during register allocation. The present implementation does not exploit this option.
Outlining should not be applied overly aggressively as otherwise the reduced locality and the additional jumps caused by the execution of outlined basic blocks will negate all potential for performance improvement. In practice, the following three cases appear to be good candidates for outlining:
The performance results reported in Section 4.3 will show that outlining alone does not make a tremendous difference in end-to-end processing latency. However, the dynamic code density improvements that it can achieve is essential to the effectiveness of the next two techniques: cloning and path-inlining.
Cloning involves creating a copy of a function. The cloned copy can be relocated to a more appropriate address and/or optimized for a particular use. For frequently executed paths, it is generally desirable to pack the involved functions as tightly as possible since the resulting increase in code-density can improve i-cache, TLB, and paging behavior. The later cloning is applied in the lifetime of a system, the more information is available to specialize the cloned functions. For example, if cloning is delayed until a TCP/IP path is established, most connection state will remain constant and can be used to partially evaluate the cloned function. This achieves similar benefits as code synthesis [60].
Just like inlining, cloning is at odds with locality of reference. Cloning at connection creation time will lead to one cloned copy per connection, while cloning at module graph creation time requires only one copy per protocol stack. By choosing the point at which cloning is performed, it is possible to tradeoff locality of reference with the amount of specialization that can be applied.
Cloning can be considered the next logical step following outlining---the latter improves (dynamic) instruction density within a function, while the former achieves the same across functions. Figure 23 summarizes the effects that outlining and cloning have on the i-cache footprint. The leftmost column shows many small i-cache gaps due to infrequently executed code. As shown in the middle column, outlining compresses frequently executed code and moves everything else to the end of the function. The rightmost column shows that cloning leads to a contiguous layout for clone A and clone B. Note that this particular example assumes that the clones and the original functions can share the outlined code. Whether this is possible depends on architectural details. For the Alpha, sharing is normally possible. Where sharing is not possible, cloning places a copy of the outlined code behind all frequently executed code.
Cloning has been implemented as a means to allow flexible experimentation with various function positioning algorithms. For this purpose, it is most adequate to perform cloning at system boot time instead of at build time and to limit specialization to simple code improvements. The supported code specializations are specific to the Alpha architecture and targeted at reducing function call overhead. In particular, under certain circumstances, the Alpha calling convention allows skipping the first few instructions in the function prologue. Similarly, if a caller and callee are spatially close, it is possible to replace a jump to an absolute address with a PC-relative branch. This typically avoids the load instruction required to load the address of the callee's entry point and also improves branch-prediction accuracy.
Extensive experiments were performed with different layout strategies for cloned code. The idea was that, ideally, it should be possible to avoid all i-cache conflicts along a critical path of execution. With a direct-mapped i-cache, the starting address of a function determines exactly which i-cache blocks it is going to occupy [62]. Consequently, by choosing appropriate addresses, it is possible to optimize i-cache behavior for the sequence of functions in a given path. The cost is that this fine-grained control of function-placement occasionally makes it necessary to introduce gaps between two consecutive functions. Gaps have the obvious cost of occupying main memory without being of any direct use. More subtly, if i-cache blocks are larger than one instruction, fetching the last instructions in a function frequently results in part of a gap being loaded into the i-cache as well, thereby wasting memory bandwidth.
A tool employing simple heuristics was devised that, based on a trace-file, computes a layout that minimizes replacement misses without introducing too many additional gaps. This approach can be called micro-positioning because function placement is controlled down to the size of an individual instruction. I-cache simulation results were encouraging---the tool made it possible to reduce replacement misses for the TCP/IP path by an order of magnitude (from 40, down to 4), while introducing only four or five new cold misses due to gaps.
However, when performing end-to-end measurements, a much simpler layout strategy consistently outperformed the micro-positioning approach. The simpler layout strategy achieves a bipartite layout. Cloned functions are divided into two classes: path functions that are executed once and library functions that are executed multiple times per path invocation. There is very little benefit in keeping path functions in the cache after they executed, as there is no temporal locality unless the entire path fits into the i-cache. In contrast, library functions should be kept cached between the first and and last invocation from a path. Based on these considerations it makes sense to partition the i-cache into a path partition and a library partition. Within a partition, functions are placed in the order in which they are called. Such a sequential layout maximizes the effectiveness of prefetching hardware that may be present. This layout strategy is so simple that it can be computed easily at runtime---the only dynamic information required is the order in which the functions are invoked. In essence, computing a bipartite layout consists of applying the well-known closest-is-best strategy to the library and path partition individually [77].
Establishing the performance advantage of the bipartite layout relative to the micro-positioning approach is difficult since small changes to the heuristics of the latter approach resulted in large performance variations. The micro-positioning approach usually performed somewhat worse than a bipartite layout and sometimes almost equally well, but never better. It is not entirely obvious why this is so and it is impossible to make any definite conclusions without even more fine-grained simulations, but we have three hypotheses. First, micro-positioning leads to a non-sequential memory access pattern because a cloned function is positioned wherever it fits best, that is, where it incurs the minimum number of replacement misses. It may be this nearly random access pattern that causes the overall slowdown. Second, the gaps introduced by the micro-positioning approach do cost extra memory bandwidth. This hypothesis is corroborated by the fact that we have not found a single instance where aligning function entry-points or similar gap-introducing techniques would have improved end-to-end latency. Note that this is in stark contrast with the findings published in [43], where i-cache optimization focused on functions with a very high degree of locality. So it may be that micro-positioning suffers because of the memory bandwidth wasted on loading gaps. Third, the DEC 3000/600 workstations used in the experiments employ a large second-level cache. It may be the case that the initial i-cache misses also missed in the second-level cache. On the other hand, i-cache replacement misses are almost guaranteed to result in a second-level cache hit. Thus, it is quite possible that 36 replacement misses were cheaper than four or five additional cold misses introduced by micro-positioning.
Despite the unexpected outcome, the above result is encouraging. It is not necessary to compute an optimal layout to improve i-cache performance---a simple layout-strategy such as the bipartite layout appears to be just as good (or even better) at a fraction of the cost. It must be emphasized that the bipartite layout strategy may not be appropriate if all the path and library functions fit into the i-cache. If it is likely that the path will remain cached between subsequent path-invocations, it is better to use a simple linear allocation scheme that allocates functions strictly in the order of invocation, that is, without making any distinction between library and path functions. This is a recurrent theme for cache-oriented optimizations: the best approach for a problem that fits into the cache is often radically different from the best one for a problem that exceeds the cache size.
The third latency reducing technique is path-inlining. This is an aggressive form of inlining where the entire latency-sensitive path of execution is inlined to form a single function. Consider that a path consists of a sequence of stages. The interfaces in the stages imply the entire processing that a message undergoes as it traverses a path. If it were possible to know the sequence of stages in a path at system build time, it would be possible to prepare optimized code for that path simply by compiling the functions encountered in the interfaces as a single function. One way to achieve this is to let the system designer choose what sequences of modules are important, pre-generate code for those important sequences, and then, at runtime, replace the modular code of a path with the optimized code. This can be realized using a path transformation rule (see Section 2.2.3.3) whose guard checks whether or not the sequence of modules in a path matches the sequence of modules for which a path-inlined function was generated. If so, the function pointers in the interfaces can be replaced with pointers to the path-inlined version.
Since path-inlining results in code that is specific to a particular sequence of modules, it is warranted only if the resulting code-path is executed frequently. It is also important to avoid inlining library-functions: by definition, library-functions are called multiple times during a single path execution, so it is better to preserve the locality of reference that they afford. In addition, inlining library-functions would likely lead to an excessive growth in code size.
The advantage of path-inlining is that it removes almost all call overhead and greatly increases the amount of context available to the compiler for optimization. For example, in the VNET module, output processing consists of simply calling the next lower layer's output function. With path-inlining, the compiler can trivially detect and eliminate such useless call overhead.
While path-inlining is simple in principle, the practical problem is quite difficult. None of the commonly available C compilers are able to inline code across module boundaries (object files) and through indirect function calls. Note that inlining through indirect function calls is made possible by information available from the path that is being optimized for. In essence, this is how routing decisions are frozen into the code-path. There are tools available that assist in cross-module inlining, but the ones that were available were not reliable enough to be of much use. While it should not be difficult to add path-information assisted cross-module inlining to an existing C compiler, the path-inlined versions for the TCP/IP and RPC paths were obtained by manually combining code from different modules and then using gcc's normal inlining facility to produce the final code. For TCP/IP, path-inlining resulted in two large functions: one for input processing and one for output processing. Roughly the same applies for the RPC stack, although the split is slightly different: one function takes care of all the processing in protocols XRPCTEST, MSELECT, VCHAN as well as the output processing in CHAN and the protocols below it, whereas the other function handles all input processing up to the CHAN protocol.
The fourth and final technique is called last call optimization. It does not exploit paths but instead addresses an issue that frequently arises during path execution, namely deeply nested call chains. Often, the last action of a partial processing function in a path stage is to call the processing function of the next stage in the path. This means that each stage in a path requires additional stack space to store the call frame for the stage's processing function. This is often wasteful since, if the call frame of a stage is not accessible to more deeply nested call frames, it could have been deallocated before calling the next stage.
The last call optimization recognizes such opportunities and avoids wasting stack space by deallocating stack frames as early as possible. Ideally, in a calling sequence that is nested N deep, the amount of stack space used would be just the maximum frame size among the active functions instead of the sum of the frame sizes. This last call optimization is an old technique (e.g., [33]) that is popular with compilers for functional languages, but not for imperative languages such as C. The main problem with imperative languages is that it is difficult to decide whether or not a callee has access to a caller's stack frame. A simple solution is to apply the last call optimization only if none of the local (automatic) variables have had their addresses taken. This is conservative but sufficiently accurate to uncover many last call opportunities.
A trial-implementation of the last call optimization in gcc showed
improvements of up to 30% for heavily recursive functions.
Unfortunately, for the TCP/IP and RPC stacks, it did not achieve a
measurable improvement. There are several likely reasons for this.
First, about half the available last call opportunities were destroyed
by the reference counting scheme that was present in the Scout
prototype. Specifically, there were several occasions where the
optimization could have been applied had it not been for the few
instructions required to decrement a reference count after returning
from a deeply nested function call.