Evaluation

4 Evaluation

Evaluation

This section evaluates the techniques presented in the previous section. It first describes the specific test cases and then follows with a presentation of end-to-end latency results and a detailed, trace-based analysis of processing behavior. For brevity, no throughput measurements are presented but we note that none of the techniques had a negative effect on it. In fact, as is commonly the case, the latency improvements also resulted in a slightly higher throughput.

4.1 Test Cases

Test Cases

Recall that all measurements were performed in the e7nvironment described in Section 2.1. Both the TCP/IP and RPC stacks were measured in several configurations. The configurations were selected to allow us to gauge the effect of each technique. An exhaustive measurement of all possible combinations would have been impractical, so we focus on the following six version and supply additional data where needed.

STD: This is the standard version from Section 2.3. It uses none of the latency-reducing optimizations described in Section 3.
OUT: Like STD, but includes outlining.
CLO: Like OUT, but includes cloning (using a bipartite layout).
BAD: Like CLO, but cloning has been used to artificially worsen the i-cache behavior. While not strictly a worst-case scenario, this version is used to establish the potential of i-cache effects to influence processing latency. Specifically, for the TCP stack, version BAD results in 217 additional i-cache and 110 additional b-cache misses (relative to CLO, which has 483 i-cache and 678 b-cache misses). For the RPC stack, it results in 233 additional i-cache and 14 additional b-cache misses (relative to CLO with 488 i-cache and 845 b-cache misses).
PIN: Like OUT, but includes path-inlining.
ALL: Like PIN, but cloning (bipartite layout) has been used to improve i-cache behavior. That is, this version uses all techniques, and is expected to achieve the best performance.

Note that path-inlining and cloning require running a packet classifier on incoming packets since the optimized code is no longer general enough to handle all possible packets. Currently, the best software classifiers add an overhead of about 1-4 micro-s per packet [2, 9]. The results presented in this section do not include the time required to classify packets. This seems justified since systems commonly use classifiers to improve functionality, flexibility, and accountability [16, 21]. In a system that already makes use of a classifier, the benefits of cloning and path-inlining are truly the ones reported below. The same applies for a system that uses a hardware classifier or a protocol stack that contains an explicit path-identifier.

4.2 End-to-End Results

End-to-End Results

TCP and RPC latency was measured by ping-ponging packets with no payload between a server and a client machine. Since TCP is stream-oriented, it does not send any network packets unless there is data to be sent. Thus, the ``no payload'' case is approximated by sending 1B of data per message. For both protocol stacks, the tests result in 64-byte frames on the wire since that is the minimum frame size for Ethernet. The reported end-to-end latency is the average time it took to complete one roundtrip in a test involving 100,000 roundtrips. Time was measured with a clock running at 1024Hz, thus yielding roughly a 1ms resolution.

For the TCP/IP stack, the optimizations were applied to both the server and client side. Since the processing on the server and client side is almost identical, the improvement on each side is simply half of the end-to-end improvement. For the RPC stack, the optimizations were restricted to the client side. On the server side, the configuration yielding the best performance was used in all measurements (which happened to be the ALL version). Always running the same RPC server ensures that the reference point remains fixed and allows a meaningful analysis of client performance.

	TCP/IP				RPC
Version	T_e [ micro-s]			\Delta [%]	T_e [ micro-s]			\Delta [%]
BAD	498.8	±	0.29	+60.5	457.1	±	0.20	+25.1
STD	351.0	±	0.28	+12.9	399.2	±	0.29	+9.2
OUT	336.1	±	0.37	+8.1	394.6	±	0.10	+8.0
CLO	325.5	±	0.07	+4.7	383.1	±	0.20	+4.8
PIN	317.1	±	0.03	+2.0	367.3	±	0.19	+0.5
ALL	310.8	±	0.27	+0.0	365.5	±	0.26	+0.0

Table 2:End-to-end Roundtrip Latency

Table 2 shows the end-to-end results. The rows are sorted according to decreasing latency, with each row giving the performance of one version of the TCP/IP and RPC stacks. The performance is reported in absolute terms as the mean roundtrip time plus/minus one standard deviation, and in relative terms as the per cent slow-down compared to the fastest version (ALL). For TCP/IP, the mean and standard deviation where computed based on ten samples; five samples were collected for RPC.

As the table shows, the BAD version of the TCP/IP stack performs by far the worst. With almost 500 micro-s per roundtrip, it is over 173 micro-s slower than version CLO, which corresponds to a slowdown of more than 53%. As explained above, the code in the two versions is mostly identical. The only significant difference is the layout of that code. This clearly shows that i-cache effects can have a profound effect on end-to-end latency.

Row STD shows that the standard x-kernel version of the protocol stacks has a much better cache behavior than BAD. That version is slower by about 12.9% for TCP/IP and 9.2% for RPC. There are two reasons why STD performs relatively well. First, earlier x-kernel experiences with direct-mapped caches led to attempts to improve cache performance by manually changing the order in which functions appear within the object files and by changing the link order. Because of such manual tuning, the STD version has a reasonably good cache behavior to begin with. Second, it also appears to be the case that the function usage pattern in the x-kernel is such that laying the functions out in the address space in what basically amounts to a random manner, yields an average performance that is closer to the best case than to the worst case. This is especially true since in the latency sensitive case, there are few loops that have the potential for pathological cache-behavior. Keep in mind, however, that case BAD is possible in practice unless the cache layout is controlled explicitly. The techniques proposed in this paper provide sufficient control to avoid such a bad layout.

Row OUT indicates that outlining works quite well for TCP/IP---it reduces roundtrip time by about 15 micro-s when compared to STD. Since both the client and the server use outlining, the reduction on the client side is roughly half of the end-to-end reduction, or 7.5 micro-s. In contrast, at a 4.6 micro-s savings, outlining makes a smaller difference to the RPC stack. This is because TCP consists of a few large functions that handle most of the protocol processing (including connection establishment, tear-down, and packet retransmission), whereas RPC consists of many small functions that often handle exceptional events through separate functions. In this sense, the RPC code is already structured in a way that handles exceptional events outside the performance critical code path. Nevertheless, outlining does result in significantly improved performance for both protocol stacks.

In contrast, row CLO indicates that cloning works better for RPC than for TCP. In the former case, the reduction on the client side is about 11.5 micro-s whereas in the latter case the client-side reduction is roughly 5.3 micro-s. This makes sense since TCP/IP absorbs most of its instruction locality in a few, big functions, meaning that there are few opportunities for self-interference. The many-small-function structure of the RPC stack makes it likely that the uncontrolled layout present in version OUT leads to unnecessary replacement misses. Conversely, this means that there are good opportunities for cloning to improve cache effectiveness.

Path-inlining also appears to work very well for the RPC stack. Since PIN is the same as version OUT with path-inlining enabled, it is more meaningful to compare it to the outlined version (OUT), rather than the next best version (CLO). If we do so, we find that the TCP/IP client side latency is about 9.5 micro-s and the RPC client side about 27.3 micro- below the corresponding value in row OUT. Again, this is consistent with the fact that the RPC stack contains many more---and typically much smaller---functions than TCP. Just eliminating call-overheads through inlining improves the performance of the RPC stack significantly.

Finally, row ALL shows the roundtrip latency of the version with all optimizations applied. As expected, it is indeed the fastest version. However, the client-side reduction for TCP/IP compared to PIN is only about 3.1 micro-s and the improvement in the RPC case is a meager 1.8 micro-s. That is, with path-inlined code, partitioning library and path functions does not increase performance much further.

While end-to-end latency improvements are certainly respectable, they are nevertheless fractional on the given test system. It is important to keep in mind, however, that modern high-performance network adapters have much lower latency than the LANCE Ethernet adapter present in the DEC 3000 system [1]. To put this into perspective, consider that a minimum-sized Ethernet packet is 64 bytes long, to which an 8 byte long preamble is added. At the speed of a 10Mbps Ethernet, transmitting the frame takes 57.6 micro-s. This is compounded by the relative tardiness of the LANCE controller itself: we measured 105 micro-s between the point where a frame is passed to the controller and the point where the ``transmission complete'' interrupt handler is invoked. The LANCE overhead of 47.4 micro-s is consistent with the 51 micro-s figure reported elsewhere for the same controller in an older generation workstation [32]. Since the latency between sending the frame and the receive interrupt on the destination system is likely to be higher, and since each roundtrip involves two message transmissions, we can safely subtract 105 micro-s x2 = 210 micro-s from the end-to-end latency to get an estimate of the actual processing time involved. For example, if we apply this correction to the TCP/IP stack, we find that version BAD is actually 186% slower than the fastest version. Even version STD is still 40% slower than version ALL.

Table 3 revisits the end-to-end latency numbers, adjusted to factor out the overhead imposed by the controller and Ethernet. While there will obviously be some additional latency, one should expect roundtrip times on the order of 50 micro-s rather than the 210 micro-s measured on our experimental platform.³

	TCP/IP		RPC
Version	T_e [ micro-s]	\Delta [%]	T_e [ micro-s]	\Delta [%]
BAD	288.8	+186.5	247.1	+59.0
STD	141.0	+40.2	189.2	+21.7
OUT	126.1	+25.1	184.6	+18.7
CLO	115.5	+14.6	173.1	+11.3
PIN	107.1	+6.3	157.3	+1.2
ALL	100.8	+0.0	155.5	+0.0

Table 3:End-to-end Roundtrip Latency Adjusted for Network Controller

4.3 Detailed Analysis

Detailed Analysis

The end-to-end results are interesting to establish global performance effects, but since some of the protocol processing can be overlapped with network I/O, they are not directly related to CPU utilization. Also, it is impossible to control all performance parameters simultaneously. For example, the tests did not explicitly control data-cache performance. Similarly, there are other sources of variability. For example, the memory free-list is likely to vary from test case to test case (e.g., due to different memory allocation patterns at startup time). While not all of these effects can be controlled, most can be measured.

Towards this end, we collected two additional sets of data. The first is a set of instruction traces that cover most of the protocol processing. The second is a set of fine-grained measurements of the execution time of the traced code. The instruction traces do not cover all of the processing since the tracing facility did not allow the tracing of interrupt handling. Other than that, the traces are complete. For the sake of brevity, we only summarize the most important results; see [22] for a more detailed discussion.

4.3.1 Processing Time Measurements

Processing Time Measurements

	TCP/IP						RPC
	T_p [ micro-s]			Length	iCPI	mCPI	T_p [ micro-s]			Length	iCPI	mCPI
BAD	167.0	±	1.75	4718	1.61	4.58	154.2	±	0.47	4253	1.69	4.66
STD	89.6	±	0.34	4750	1.72	1.58	85.1	±	0.53	4291	1.78	1.69
OUT	84.1	±	0.12	4728	1.61	1.50	81.0	±	0.16	4257	1.68	1.65
CLO	77.2	±	0.36	4684	1.61	1.28	71.0	±	0.29	4227	1.69	1.25
PIN	69.9	±	0.48	4245	1.57	1.31	57.7	±	0.18	3471	1.66	1.25
ALL	66.1	±	0.48	4215	1.57	1.17	49.2	±	0.12	3468	1.67	0.81

Table 4:Protocol Processing Costs

Given the execution traces and timings, it is possible to derive a number of interesting quantities: protocol processing time, CPI, and, most importantly, the memory CPI (mCPI). Protocol processing time was measured with the CPU's cycle counter. Thus, the CPI is obtained by dividing the cycle count by the trace length. The memory CPI can then be calculated by subtracting the instruction CPI (iCPI). The iCPI is the average number of cycles spent on each instruction assuming a perfect memory system. That is, the stall cycles in this quantity are purely due to data-dependencies and limitations in the CPU. The iCPI was derived from the instruction trace by feeding it into a CPU simulator. The simulator is somewhat crude with respect to branches as it simply adds a fixed penalty for each taken branch. Other than that, the simulator has almost perfect accuracy.

The trace-based data is shown in Table 4. Columns T_p shows the measured processing time in micro-seconds. As before, this is shown as the sample mean plus/minus the sample standard deviation. The column labeled Length gives the trace length as an instruction count. Columns mCPI and iCPI are the memory and instruction CPI values, respectively.

Looking at the iCPI columns, we find that both the TCP/IP and RPC stacks break down into three classes: the standard version has the largest iCPI, the versions using outlining (BAD, OUT, CLO) have the second largest value, and the path-inlined versions have the smallest value. This is expected since the code within each class is largely identical. Since the CPU simulator adds a fixed penalty for each taken branch, the decrease in the iCPI as we go from the standard version to the outlined versions corresponds directly to the reduction in taken branches. Interestingly, outlining improves iCPI by almost exactly 0.1 cycles for both protocol stacks. This is a surprisingly large reduction considering that path-inlining achieves a reduction of 0.04 cycles at the most. We expected that the increased context available in the inlined versions would allow the compiler to improve instruction scheduling more. Since this does not seem to be the case, the performance improvement due to path-inlining stems mostly from a reduction in the critical-path code size. Note that even in the best case, the iCPI value is still above 1.5. While some of this can be attributed to suboptimal code generation on the part of gcc 2.6.0, a more fundamental reason for this large value is the structure of low-level systems code: the traces show that there is very little actual computation, but much interpretation and branching.

The mCPI columns in the table show that, except for the RPC case of ALL, the CPU spends well above 1 cycle per instruction waiting for memory (on average). Comparing the mCPI values for the various versions, we find that the proposed techniques are rather effective. Both protocol stacks achieve a reduction by a factor of more than 3.9 when going from version BAD to version ALL. Even when comparing version ALL to STD we find that the latter has an mCPI that is more than 35% larger. In terms of mCPI reduction, cloning with a bipartite layout and path-inlining are about equally effective. The former is slightly more effective for the TCP/IP stack, but in the RPC case, both achieve a reduction of 0.4 cycles per instruction. Combining the two techniques does have some synergistic effects for TCP/IP. The additional reduction compared to outlining or path-inlining alone is small though, on the order of 0.11 to 0.14 cycles per instruction. Of all the mCPI values, the value 0.81 for the ALL version of the RPC stack clearly stands out. Additional data presented in [22] leads us to believe that the value is an anomaly: just small changes to the code lead to mCPI values more in line with the other results. This serves as a reminder that while the proposed techniques improve cache behavior, it is nearly impossible to achieve perfect control for any reasonably complex system.

4.3.2 Outlining Effectiveness

Outlining Effectiveness

The results presented so far are somewhat misleading in that they underestimate the benefits of outlining. While it does achieve performance improvements in and of itself, it is more important as an enabling technology for path-inlining and cloning. Thanks to outlining, the amount of code replication due to path-inlining and cloning is greatly reduced. Together with this size reduction goes an increase in dynamic instruction density, that is, less memory bandwidth is wasted loading useless instructions. This is demonstrated quantitatively with the results presented in Table 5. It shows that without outlining (STD version), around 20% of the instructions loaded into the cache never get executed. For both protocol stacks, outlining reduces this by roughly a factor of 1.4. It is illustrative to consider that a waste of 20% corresponds to 1.8 unused instructions per eight instructions (one cache block). Outlining reduces this to about 1.3 unused instructions. This is a respectable improvement, especially considering that outlining was applied conservatively.

	Without Outlining		With Outlining
	i-cache		i-cache
	unused	Size	unused	Size
TCP/IP	21%	5841	15%	3856
RPC	22%	5085	16%	3641

Table 5:Outlining Effectiveness

The table also shows that outlining results in impressive critical-path code-size reductions. The Size columns show the static code size (in number of instructions) of the latency critical path before and after outlining. In the TCP/IP stack, about 1985 instructions could be outlined, corresponding to 34% of the code. Note that this is almost a full primary i-cache worth of instructions (8KB). Similarly, in the RPC stack 28% of the 5085 instructions could be outlined. This reinforces our claim that outlining is a useful technique not only because of its direct benefits, but also as a means to greatly improve cloning and path-inlining effectiveness. Minimizing the size of the main-line code improves cloning flexibility and increases the likelihood that the entire path will fit into the cache.