Next Up Previous Contents References

4.3 Evaluation

Evaluation

This section evaluates the three path-based techniques presented in the previous section. It first describes the specific test cases and then follows with a presentation of end-to-end latency results and a detailed, trace-based analysis of processing behavior. Throughput measurements are not presented since they would simply confirm that none of the techniques had a measurably negative effect on throughput. In fact, as is commonly the case, the latency improvements also resulted in a slightly higher throughput.

4.3.1 Test Cases

Test Cases

All measurements are for the environment described in Section 4.1.1. Both TCP/IP and RPC are measured in several configurations that enable gauging the effect of each technique. An exhaustive measurement of all possible combinations would have been impractical, so we focus on the following six version and supply additional data as necessary.

Note that all versions except STD use path specific code. Thus, it is necessary to lookup the appropriate path for incoming packets either by using PathIDs [56] or a packet classifier (such as the one presented in Section 3.4.1). To keep this study independent of the quality of the classifier, which may well be assisted by hardware, and independent of whether PathIDs are used, the path lookup costs are excluded from the results presented in the remainder of this chapter.

4.3.2 End-to-End Results

End-to-End Results

TCP and RPC latency was measured by ping-ponging packets with no payload between a server and a client machine. Since TCP is stream-oriented, it does not send any network packets unless there is data to be sent. Thus, the no payload case is approximated by sending 1B of data per message. For both protocol stacks, the tests result in 64-byte frames on the wire since that is the minimum frame size for Ethernet. The reported end-to-end latency is the average time it 23took to complete one roundtrip in a test involving 100,000 roundtrips. Time was measured with a clock running at 1024Hz, thus yielding roughly a 1ms resolution.

For the TCP/IP stack, the optimizations were applied to both the server and client side. Since the processing on the server and client side is almost identical, the improvement on each side is simply half of the end-to-end improvement. For the RPC stack, the optimizations were restricted to the client side. On the server side, the configuration yielding the best performance was used in all measurements (which happened to be the ALL version). Always running the same RPC server ensures that the reference point remains fixed and allows a meaningful analysis of client performance.

TCP/IP RPC
Version Te [µs] \Delta [%] Te [µs] \Delta [%]
BAD 498.8 ± 0.29 +60.5 457.1 ± 0.20 +25.1
STD 351.0 ± 0.28 +12.9 399.2 ± 0.29 +9.2
OUT 336.1 ± 0.37 +8.1 394.6 ± 0.10 +8.0
CLO 325.5 ± 0.07 +4.7 383.1 ± 0.20 +4.8
PIN 317.1 ± 0.03 +2.0 367.3 ± 0.19 +0.5
ALL 310.8 ± 0.27 +0.0 365.5 ± 0.26 +0.0

Table 6: Roundtrip Latency

Table 6 shows the end-to-end results. The rows are sorted according to decreasing latency, with each row giving the performance of one version of the TCP/IP and RPC stacks. The performance is reported in absolute terms as the mean roundtrip time plus/minus one standard deviation, and in relative terms as the per cent slow-down compared to the fastest version (ALL). For TCP/IP, the mean and standard deviation were computed based on ten samples; five samples were collected for RPC.

As the table shows, the BAD version of the TCP/IP stack performs by far the worst. With almost 500µs per roundtrip, it is over 173µs slower than version CLO, which corresponds to a slowdown of more than 53%. As explained above, the code in the two versions is essentially identical. The only significant difference is the layout of that code. This clearly shows that i-cache effects can have a profound effect on end-to-end latency.

Row STD shows that the regular Scout prototype version of the protocol stacks has a much better cache behavior than BAD. Compared to the best case, that version is slower by about 12.9% for TCP/IP and 9.2% for RPC. There are two reasons why STD performs relatively well. First, earlier experiences with direct-mapped caches led to attempts to improve cache performance by manually changing the order in which functions appear within the object files and by changing the link order. Because of such manual tuning, the STD version has a reasonably good cache behavior to begin with. Second, it also appears to be the case that the function usage pattern in the Scout prototype is such that laying out the functions in the address space in what basically amounts to a random manner, yields an average performance that is closer to the best case than to the worst case. This is especially true since in the latency sensitive case, there are few loops that have the potential for pathological cache-behavior. Keep in mind, however, that case BAD is possible in practice unless the cache layout is controlled explicitly. The techniques proposed in this chapter provide sufficient control to avoid such a bad layout.

Row OUT indicates that outlining works quite well for TCP/IP---it reduces roundtrip time by about 15µs compared to STD. Since both the client and the server use outlining, the reduction on the client side is roughly half of the end-to-end reduction, or 7.5µs. Outlining works less well for RPC but is still able to achieve a 4.6µs reduction. This behavior can be explained by the fact that TCP consists of a few large functions that handle most of the protocol processing (including connection establishment, tear-down, and packet retransmission), whereas RPC consists of many small functions that often handle exceptional events through separate functions. In this sense, the RPC code is already structured in a way that handles exceptional events outside the performance critical code path. Nevertheless, outlining does result in significantly improved performance for both protocol stacks.

In contrast, row CLO indicates that cloning works better for RPC than for TCP. In the former case, the reduction on the client side is about 11.5µs whereas in the latter case the client-side reduction is roughly 5.3µs. This makes sense since TCP/IP absorbs most of its instruction locality in a few, big functions, meaning that there are few opportunities for self-interference. The many-small-functions structure of the RPC stack makes it likely that the uncontrolled layout present in version OUT leads to unnecessary replacement misses. Conversely, this means that there are good opportunities for cloning to improve cache effectiveness.

Path-inlining also appears to work well for the RPC stack. Since PIN is the same as version OUT with path-inlining enabled, it is more meaningful to compare it to the outlined version (OUT), rather than the next best version (CLO). If we do so, we find that the TCP/IP client side latency is about 9.5µs and the RPC client side about 27.3µs below the corresponding value in row OUT. Again, this is consistent with the fact that the RPC stack contains many more---and typically much smaller---functions than TCP. Just eliminating call-overhead through inlining improves the performance of the RPC stack significantly.

Finally, row ALL shows the roundtrip latency of the version with all optimizations applied. As expected, it is indeed the fastest version. However, the client-side reduction for TCP/IP compared to PIN is only about 3.1µs and the improvement in the RPC case is a meager 1.8µs. That is, with path-inlined code, partitioning library and path functions does not increase performance much further.

While end-to-end latency improvements are certainly respectable, they are nevertheless fractional on the given test system. It is important to keep in mind, however, that modern high-performance network adapters have much lower latency than the LANCE Ethernet adapter present in the DEC 3000 system [3]. To put this into perspective, consider that a minimum-sized Ethernet packet is 64 bytes long, to which an 8 byte long preamble is added. At the speed of a 10Mbps Ethernet, transmitting the frame takes 57.6µs. This is compounded by the relative tardiness of the LANCE controller itself: we measured 105µs between the point where a frame is passed to the controller and the point where the transmission complete interrupt handler is invoked. The LANCE overhead of 47.4µs is consistent with the 51µs figure reported elsewhere for the same controller in an older generation workstation [104]. Since the latency between sending the frame and the receive interrupt on the destination system is likely to be higher, and since each roundtrip involves two message transmissions, we can safely subtract 105µs × 2 = 210µs from the end-to-end latency to get an estimate of the actual processing time involved. For example, if we apply this correction to the TCP/IP stack, we find that version BAD is actually 186% slower than the fastest version. Even version STD is still 40% slower than version ALL.

Table 7 revisits the end-to-end latency numbers, adjusted to factor out the overhead imposed by the controller and Ethernet. While there will obviously be some additional latency, one should expect roundtrip times on the order of 50µs rather than the 210µs measured on our experimental platform.13

TCP/IP RPC
Version Te [µs] \Delta [%] Te [µs] \Delta [%]
BAD 288.8 +186.5 247.1 +59.0
STD 141.0 +40.2 189.2 +21.7
OUT 126.1 +25.1 184.6 +18.7
CLO 115.5 +14.6 173.1 +11.3
PIN 107.1 +6.3 157.3 +1.2
ALL 100.8 +0.0 155.5 +0.0

Table 7: Roundtrip Latency Adjusted for Network and Controller

4.3.3 Detailed Analysis

Detailed Analysis

The end-to-end results are interesting to establish global performance effects, but since some of the protocol processing can be overlapped with network I/O, they are not directly related to CPU utilization. Also, it is impossible to control all performance parameters simultaneously. For example, the tests did not explicitly control data-cache performance. Similarly, there are other sources of variability. For example, the memory free-list is likely to vary from test case to test case (e.g., due to different memory allocation patterns at boot time). While not all of these effects can be controlled, most can be measured.

Towards this end, we collected two additional data sets. The first is a set of instruction traces that cover most of the protocol processing. The second is a set of fine-grained measurements of the execution time of the traced code. The instruction traces do not cover all of the processing since the tracing facility did not allow the tracing of interrupt handling, but other than that, the traces are complete.

4.3.3.1 Cache Statistics

Cache Statistics

Using the execution traces and a simulator of the DEC 3000/600 memory hierarchy it is possible to compute the cache statistics presented in Table 8. It lists the i-cache, d-cache, and b-cache performance as the number of misses to the cache (column Miss), the total number of accesses to the cache (column Acc), and the number of replacement misses (column Repl). Note that the middle three columns represents a combination of the d-cache and write-buffer performance since the d-cache is used only on the read path and the write-buffer is used only on the write path. The write-buffer in the 21064 CPU performs write-merging, so a merged write is counted like a cache-hit whereas a write that results in a write to the b-cache is counted like a cache-miss.

The rightmost column in the table shows that, except for the BAD versions, none of the tests cause replacement misses in the b-cache. Since the entire kernel is small enough to fit into the b-cache, this means that all code executes out of the b-cache unless there are conflicts with data accesses performed outside of the traced code.

i-cache d-cache/wr-buffer b-cache
Miss Acc. Repl Miss Acc Repl Miss Acc Repl
BAD 700 4718 224 459 1862 31 863 1390 110
STD 586 4750 72 492 1845 56 800 1286 0
OUT 547 4728 69 462 1841 40 731 1183 0
TCP/IP CLO 483 4684 27 455 1862 34 678 1074 0
PIN 484 4245 66 406 1668 27 630 1015 0
ALL 414 4215 10 401 1682 28 596 913 0
BAD 721 4253 176 556 1663 19 995 1544 14
STD 590 4291 31 547 1635 14 1004 1379 0
OUT 542 4257 26 556 1629 19 951 1313 0
RPC CLO 488 4227 7 547 1664 13 845 1213 0
PIN 402 3471 14 453 1310 19 694 972 0
ALL 374 3468 0 450 1330 13 662 931 0

Table 8: Cache Performance. Miss: number of accesses that missed in the cache. Acc: Total number of cache accesses. Repl: Number of replacement misses.

A more important observation is that the cache simulations confirm that cloning with a bipartite layout does indeed help avoid i-cache replacement misses. For example, applying cloning to version OUT reduces the number of i-cache replacement misses in the TCP/IP stack from 69 to 27. Interestingly, path-inlining alone does not get rid of many replacement misses. The table shows that the PIN version still suffers from 66 such misses. This is because, for PIN, there is nothing that prevents library code from clashing with path code.

The RPC case is analogous to TCP/IP except that the reductions in replacement misses are even larger: compared to version OUT, cloning alone causes a reduction by a factor of 3.7 and, together with path-inlining, not a single replacement miss remains.

4.3.3.2 Processing Time Measurements

Processing Time Measurements

Given the execution traces and timings, it is possible to derive a number of interesting quantities: protocol processing time, CPI, and the memory CPI (mCPI)---the average number of cycles that each instruction was stalled due to a memory access. Protocol processing time can be measured using the CPU cycle counter. From this, the CPI can be derived from this by dividing it with the length of the trace (in instructions). The memory CPI can then be calculated by subtracting the instruction CPI (iCPI). The iCPI is the average number of cycles spent on each instruction assuming a perfect memory system, which means that the relationship CPI=iCPI+mCPI holds. The iCPI can be derived from the instruction trace by feeding it into a CPU simulator that assumes a perfect memory system. The simulator used for this study is somewhat crude with respect to branches as it simply adds a fixed penalty for each taken branch, but other than that, it has almost perfect accuracy.

TCP/IP RPC
Tp [µs] Length iCPI mCPI Tp [µs] Length iCPI mCPI

BAD

167.0 ± 1.75 4718 1.61 4.58 154.2 ± 0.47 4253 1.69 4.66
STD 89.6 ± 0.34 4750 1.72 1.58 85.1 ± 0.53 4291 1.78 1.69
OUT 84.1 ± 0.12 4728 1.61 1.50 81.0 ± 0.16 4257 1.68 1.65
CLO 77.2 ± 0.36 4684 1.61 1.28 71.0 ± 0.29 4227 1.69 1.25
PIN 69.9 ± 0.48 4245 1.57 1.31 57.7 ± 0.18 3471 1.66 1.25
ALL 66.1 ± 0.48 4215 1.57 1.17 49.2 ± 0.12 3468 1.67 0.81

Table 9: Protocol Processing Costs

The results obtained when applying these derivations are shown in Table 9. The Tp columns show measured processing time in micro-seconds. Like before, this is shown as the sample mean plus/minus the sample standard deviation. The columns labelled Length give the trace length in instructions. Columns mCPI and iCPI are the memory and instruction CPI values, respectively.

Looking at the iCPI columns, we find that both the TCP/IP and RPC stack break down into three classes: the standard version has the largest iCPI, the versions using outlining (BAD, OUT, CLO) have the second largest value, and the path-inlined versions have the smallest value. This is expected since the code within each class is largely identical. Since the CPU simulator adds a fixed penalty for each taken branch, the decrease in iCPI when going from the standard version to the outlined versions corresponds directly to the reduction in taken branches. Interestingly, outlining improves iCPI by almost exactly 0.1 cycles for both protocol stacks. This is a surprisingly large reduction considering that path-inlining achieves a reduction of 0.04 cycles at the most. We expected that the increased context available in the inlined versions would allow the compiler to improve instruction scheduling more. Since this does not seem to be the case, the performance improvement due to path-inlining stems mostly from a reduction in the critical-path code size. Note that even in the best case, the iCPI value is still above 1.5. While some of this can be attributed to suboptimal code generation on the part of gcc 2.6.0, a more fundamental reason for this large value is the structure of systems code: the traces show that there is very little actual computation, but much interpretation and branching.

The mCPI columns in the table show that, except for the RPC case of ALL, the CPU spends well above 1 cycle per instruction waiting for memory (on average). Comparing the mCPI values for the various versions, we find that the proposed techniques are rather effective. Both protocol stacks achieve a reduction by a factor of more than 3.9 when going from version BAD to version ALL. Even when comparing version ALL to STD we find that the latter has an mCPI that is more than 35% larger. In terms of mCPI reduction, cloning with a bipartite layout and path-inlining are about equally effective. The former is slightly more effective for the TCP/IP stack, but in the RPC case, both achieve a reduction of 0.4 cycles per instruction. Combining the two techniques does have some synergistic effects for TCP/IP. The additional reduction compared to outlining or path-inlining alone is small though, on the order of 0.11 to 0.14 cycles per instruction. Of all the mCPI values, the value 0.81 for the ALL version of the RPC stack clearly stands out. Additional measurements that are not reported here lead us to believe that the value is an anomaly: just small changes to the code resulted in mCPI values more in line with the other results. This serves as a reminder that while the proposed techniques improve cache behavior, it is nearly impossible to achieve perfect control for any reasonably complex system.

4.3.3.3 Performance Improvement Comparison

Performance Improvement Comparison

It is now possible to compare the end-to-end results with the processing execution times and the trace-based cache simulation results. First, we would like to verify that the outlining and cloning improvements are really primarily due to i-cache rather than due to (uncontrolled) d-cache effects. The fact that the mCPI values are much bigger than zero indicates that the memory system is the bottleneck. As all versions run out of the b-cache (except for the BAD versions), the processing time improvement is dominated by changes in the number of b-cache accesses (column \Delta Nb in Table 10). Thus, we would like to know the percentage I of b-cache access reduction due to the i-cache, as opposed to the d-cache/write-buffer.14 If the number of b-cache accesses is reduced purely due to the i-cache, the percentage would be 100%. If the reduction is purely due to the d-cache/write-buffer, it would be 0%. A value greater than 100% indicates that d-cache/write-buffer performance got worse, but that the i-cache was able to compensate for the losses so that, overall, the number of b-cache access was still reduced.

TCP/IP RPC
I \Delta Te \Delta Tp \Delta Nb \Delta Nm I \Delta Te \Delta Tp \Delta Nb \Delta Nm

BAD

-> CLO 97 86.7 89.8 316 110 99 74.0 83.2 331 14
STD -> OUT 114 7.4 5.5 103 0 71 4.6 4.1 66 0
OUT -> CLO 91 5.3 6.9 109 0 94 11.5 10.0 100 0
OUT -> PIN 70 9.5 14.2 168 0 67 27.3 23.3 341 0
PIN -> ALL 93 3.2 3.8 102 0 95 1.8 8.5 41 0

Table 10: Comparison of Latency Improvement

Table 10 lists this percentage for both the TCP/IP and RPC protocol stack in the respective columns labelled I. Note that in all but one case more than 90% of the b-cache access reductions obtained through outlining and cloning are due to the i-cache. The exception is where outlining is applied to the standard Scout prototype version of the RPC stack. As shown in row STD->OUT, the i-cache can take credit for only 71% of the reduction in b-cache accesses, but in all other cases, d-cache effects did not lead to significantly overestimating the benefit of a technique.

It is also interesting to compare the end-to-end latency improvements (\Delta Te) with the improvements in processing time (\Delta Tp) as reported in Table 10. The data shows that the improvements are generally consistent with each other. The end-to-end improvement may be bigger than the processing time improvement if portions of the untraced code executed faster as well. The converse can occur if, for example, most of the improvement occurs in a section of the code whose execution overlaps network communication. The only place where these figures deviate significantly from each other is in the RPC case, when going from the path-inlined version (PIN) to the version that was also cloned (ALL). In this case, the processing time improvement is 8.5µs, but the observed end-to-end improvement is only 1.8µs. This is the same case that caused the 0.81 mCPI reported in Table 9 and, as explained previously, is most likely an anomaly.

Finally, it is possible to cross check whether the time improvements are consistent with the reductions in the number of b-cache accesses. If we divide the processing time improvements (\Delta Tp) in the second through fifth row by the difference in the number of b-cache accesses (\Delta Nb) we get an average b-cache latency in the range from 5.6 to to 17.5 cycles.15 These values appear reasonable considering that a b-cache access takes 10 cycles to complete. It would be unrealistic to expect exactly 10 cycles per miss since this simple model ignores many of the finer aspects of the DEC 3000's memory system. Also, the same reasoning cannot be applied to the BAD->CLO improvements since the number of b-cache blocks that remained cached across multiple path invocations was not measured. Nevertheless, the table includes the data for the sake of completeness. Note that the PIN->ALL change in the RPC stack yields an 8.5µs processing time improvement, but the difference in the number of b-cache accesses is only 41. This confirms that the anomalous improvement cannot be due to the decrease in the number of b-cache accesses since, otherwise, that would imply that each access cost on the order of 36 cycles---almost four times the theoretical latency.

4.3.3.4 Outlining Effectiveness

Outlining Effectiveness

The results presented so far are somewhat misleading in that they underestimate the benefits of outlining. While it does achieve performance improvements in and of itself, it is also important as an enabling technology for path-inlining and cloning. Thanks to outlining, the amount of code replication due to path-inlining and cloning is greatly reduced. Together with this size reduction goes an increase in dynamic instruction density, that is, less memory bandwidth is wasted loading useless instructions. This can be demonstrated quantitatively with the results presented in Table 11. It shows that without outlining (STD version), around 20% of the instructions loaded into the cache never get executed. For both protocol stacks, outlining reduces this by roughly a factor of 1.4. It is illustrative to consider that a waste of 20% corresponds to 1.8 unused instructions per eight instructions (one cache block). Outlining reduces this to about 1.3 unused instructions. This is a respectable improvement, especially considering that outlining was applied conservatively.

Without Outlining With Outlining
i-cache i-cache
unused Size unused Size
TCP/IP 21% 5841 15% 3856
RPC 22% 5085 16% 3641

Table 11: Outlining Effectiveness

The table also shows that outlining results in impressive critical-path code-size reductions. The Size columns show the static code size (in number of instructions) of the latency critical path before and after outlining. In the TCP/IP stack, about 1985 instructions could be outlined, corresponding to 34% of the code. Note that this is almost a full primary i-cache worth of instructions (8KB). Similarly, in the RPC stack 28% of the 5085 instructions could be outlined. This reinforces the claim that outlining is a useful technique not only because of its direct benefits, but also as a means to greatly improve cloning and path-inlining effectiveness. Minimizing the size of the path code improves cloning flexibility and increases the likelihood that the entire path will fit into the cache.


Next Up Previous Contents References