Performance Implications of Modular Systems

1.4 Performance Implications of Modular Systems

The benefit of modular systems stems from the fact that each module is developed independently and without making any assumptions about the context in which it will be used. This enables system builders to combine modules in ways not anticipated by the programmer, subject to only one constraint: that modules be connected in a compatible and meaningful fashion. But this very benefit is also a disadvantage: many performance optimizations are either difficult or impossible to employ in the absence of the full context in which a module is being used. Aside from impacting performance, modularity also makes it difficult to solve problems that require global context, such as guaranteeing a certain quality-of-service or optimal resource management.

The alternative to modular systems is to build systems in a vertically integrated fashion. With this approach, a system is programmed from the ground up and tailored exactly to the problem that it is designed to solve. By definition, the entire context of a system is known and available at the time the system is programmed. Given enough time and effort, this is guaranteed to lead to a system with the best possible performance. On the other hand, since modularity is given up, this approach is complex, time-consuming, and expensive. In other words, it is justifiable only when the lifetime or market for that one particular appliance is so large that the development costs can be amortized.

It would be interesting to quantitatively compare the two approaches. This is surprisingly difficult because hardly ever is one and the same functionality (product) implemented both ways and then compared in an objective and direct manner. There are a few exceptions to this rule, however. Network file system servers are commercially important enough that there are several companies that build vertically integrated servers. For example, Network Appliance manufactures what can be reasonably considered vertically integrated file servers, whereas Digital Equipment is manufacturing relatively modular, UNIX based servers.² By comparing the SPEC file server benchmark [111] results for one file server from each company, we can provide at least one data-point that directly quantifies the cost of modularity. A price/performance ratio would be easiest to interpret, but prices for computer systems are notoriously volatile and pricing strategies also vary greatly between different companies. Thus, instead of comparing a price/performance ratio, we use the SPEC result and the amount of hardware required to achieve that result as a performance metric. Table 1 lists the SPEC file server benchmark results and hardware configurations for two comparable systems: a Network Appliance F540 and a Digital AlphaServer 2000 4/275. The table shows that for a response time of around 7.7ms, the F540 delivers more than five times the throughput of the AlphaServer. This is particularly remarkable when comparing the hardware configurations. As the table shows, the AlphaServer has much more raw hardware power than the F540; it has twice the number of CPUs, twice the amount of second level cache, four times the memory capacity, and almost twice the number of disks of the F540!

To be fair, the performance differential is not entirely due to modularity. Although no quantitative results are available, the fact that UNIX is a general user environment likely accounts for a good portion of the performance gap. What we can say with confidence, however, is that the above comparison demonstrates that a vertically integrated system greatly outperforms a relatively modular and general system. In the remainder of this section, we provide more direct evidence that modularity can have a significant cost on performance.

	F540 [97]	AlphaServer 2000 [98]
SPECnfs_A93	2,230 ops/sec @ 7.7ms	404 ops/sec @ 7.6ms
CPU	275MHz 21064A Alpha	275 MHz 21064 Alpha
Number of CPUs	1	2
Second-level cache	2MB	4MB
Other cache	8MB NVRAM	Prestoserve
Memory	256MB	1024MB
Number of disks	14	25

Table 1: SPECnfs Results and System Configurations

1.4.1 Potential For Performance Improvements

Potential For Performance Improvements

Since it is rare to find systems that exist both in a modular and a vertically integrated version, it is necessary to look for other metrics that help quantifying the cost of modularity. A useful metric is the performance improvement that can achieved when (manually) optimizing the performance of a modular system. There are many examples of this in the literature, of which we now discuss a few.

1.4.1.1 Code Synthesis

Code Synthesis

Code synthesis, also known as run-time code-generation, has been used in the Synthesis kernel to optimize code across module boundaries [86, 60]. The two main-techniques involved factoring invariants and collapsing layers, which are forms of partial evaluation. In extreme cases, such as reading a single byte from a memory pseudo-device (/dev/mem in UNIX), these techniques achieved order-of-magnitude improvements compared to regular UNIX kernels [89]. Similar techniques were applied in a later project called Synthetix. While less aggressive, it was more practical in that it applied code synthesis to an existing commercial operating system, namely HP-UX. The results reported in [85] indicate speedups in the range from 1.12 to 3.61 for the UNIX read system-call compared to the regular HP-UX version.

1.4.1.2 Integrated Layer Processing

Integrated Layer Processing

The fundamental observation behind Integrated Layer Processing (ILP) [16, 1] is that as a network packet passes through various protocol processing steps, its data may be traversed multiple times. For example, an Ethernet driver may first copy the data from the network adapter to main-memory, then UDP may compute a checksum and, finally, a Remote Procedure Call (RPC) protocol may swap the byte order of the data. This is suboptimal since more or less the same data is accessed multiple times. This a causes larger-than-necessary overhead per data byte, and worse, results in a poor memory access pattern since the same data is moved from the memory to the CPU and then back to the memory multiple times. A system that uses ILP collapses all data processing into a single loop. That is, the data is brought into the CPU only once, thus greatly improving the efficiency of the memory system. Indeed, Abbott and Peterson [1] report communication bandwidth improvements in the range of 10 to 30% due to ILP.

1.4.1.3 PathIDs

PathIDs

PathIDs [56] is a mechanism that allows substituting the implementation of a specific network protocol stack with hand-optimized, vertically integrated code. The mechanism essentially involves inserting an additional network header right above the link-layer. This extra header indicates which, if any, optimized code should be used to process an incoming network packet. In a test-implementation, PathIDs helped reduce one-byte UDP latency between a pair of FDDI-connected Alpha workstations running UNIX from 759µs to 578µs; a 23 percent reduction. It should be noted that PathIDs optimize the receive-side of protocol processing only. That is, a large fraction of the 578µs of the optimized time is due to fixed costs such as time on the wire and sender-side processing. In this light, a 23 percent improvement is very significant.

1.4.1.4 Single-Copy TCP/IP

Single-Copy TCP/IP

Banks and Prudence [6] present what amounts to a vertically integrated networking stack. The stack under consideration was a typical UNIX networking stack consisting of a socket-layer, TCP and IP layers [83, 82], and a network driver layer. The vertical integration ensured that both on the outgoing and incoming side, network data is copied only once. This involved combining the copy routine with the checksumming routine, changing the socket layer so that outgoing data is placed in appropriately sized chunks of network-adapter memory, changing the network driver processing so incoming packets are split into headers and data, and changing TCP to properly handle delayed acknowledgements that arise from the fact that the checksum of received packets can be computed only when the user-level process is ready to receive the packet's data. Clearly, creating this vertically integrated version was not without difficulties, but the resulting performance improvements were impressive: communication bandwidth increased from about 7,500 to 11,600 kilobytes per second. This corresponds to a 66 percent improvement in bandwidth.

1.4.2 Implications on Quality-of-Service and Predictability

Implications on Quality-of-Service and Predictability

The preceding examples show that modularity can have a tremendous impact on performance. Researchers were able to achieve speedups in the range from twenty to several hundred percent by applying various verticalization techniques to otherwise purely modular systems. But modularity also has a negative effect that cannot be quantified easily: resource-management problems such as quality-of-service or predictability are often difficult, if not impossible, to solve in purely modular systems. The key issue is that sometimes a reasonable combination of modules has unwanted behavior, even though the modules themselves work according to their specifications.

For example, consider a simple filter that takes as input a message (sequence of data bytes) and produces as output a message that contains the run-length encoded data of the input message. Suppose this filter were used as part of a networking stack through which a mix of different packets may flow, some of which may have realtime constraints associated. Unfortunately, since the filter does not know which packets have realtime constraints, it cannot schedule the CPU appropriately. As a result some of realtime packets may miss their deadlines needlessly. Rather than fixing the filter to make it aware of what packets have realtime constraints, a better solution would be to simply recognize that there are resource management issues that are associated with the data, rather than with the particular module that is currently processing it. Once we recognize that fact, we can look for a more general solution that would make it possible to use unmodified filters, such as the run-length encoder, while retaining the ability to perform proper resource management.

Note that such resource management problems can occur not just for CPU scheduling but also for memory management and indeed for any resource in a computer system. In the memory management realm, consider that some applications may require hard guarantees on the availability of memory. For example, paging over the network requires that the networking subsystem can guarantee that it does not run out of memory while processing a packet related to paging. Otherwise, the pager itself may deadlock when attempting to free up memory by paging out over the network. Again, one might be able to solve this problem by modifying each module in the networking subsystem, but a more general solution would certainly be preferable.

To summarize, since a module by definition does not concern itself with the context in which it is being used, modular systems by themselves cannot accommodate applications that need to provide global service guarantees such as the processing of a data-item within a given deadline or without running out of memory.