Concluding Remarks

4.4 Concluding Remarks

Concluding Remarks

This chapter introduced four latency-reducing techniques: one can take advantage of paths (outlining), two critically depend on paths (cloning and path-inlining), and the fourth (last call optimization) is targeted at reducing overhead due to the deep call-chains that are common to path execution. The fourth technique failed to measurably improve performance in the test environment, but is likely to be beneficial under different circumstances, as discussed in Section 4.2.4. The three path-based techniques achieve two kinds of benefits: first, they improve execution speed by reducing protocol processing latency and, second, they improve predictability of path execution time by providing explicit control over critical-path code layout.

The path-based techniques reduce processing latency by improving the memory system behavior of the code path, i.e., by reducing the mCPI. Fundamentally, this can be achieved by (a) increasing the dynamic instruction stream density, (b) reducing the number of cache conflicts, and (c) reducing the critical-path code size. Each of these components is addressed by one or more of the techniques presented. Since the gap between processor and memory speed continues to widen, the techniques are likely to become even more important in the future. For example, this study was conducted on a machine with a 175MHz dual-issue Alpha processor and a 100MB/s memory system. Now, systems with 600MHz quadruple-issue processors and 200MB/s memory systems are readily available---in other words, the peak execution rate increased by almost a factor of seven, yet the memory bandwidth increased by only a factor of two.

Predictability is often just as important as raw performance. The proposed techniques address this issue by enabling reducing fluctuations in execution speed. The BAD case reported in Section 4.3 demonstrates that an uncontrolled i-cache layout can have a profound effect. Even though this case was constructed artificially, suboptimal configurations are possible and not are uncommon in practice. For example, the measured mCPI for the DEC UNIX v3.2c TCP/IP stack is 2.3, which is significantly worse than the 1.58 mCPI measured for the standard Scout prototype version. The proposed techniques, though not guaranteeing perfect behavior, avoid such bad i-cache behavior.