Colloquium Speaker

Speaker:Sarita Adve
Rice University
Topic:Instruction-Level Parallelism and Shared-Memory Multiprocessors
Date:Thursday, April 1, 1999
Time:9:30 AM
Place:Gould-Simpson, Room 701


Refreshments will be served in the 7th-floor lobby of Gould-Simpson at 9:15 AM


ABSTRACT


Current shared-memory multiprocessors are built from current commodity processors, which exploit instruction-level parallelism (ILP) through increasingly complex architectural features. Most evaluation work on shared-memory systems, however, assumes previous-generation processors that are much simpler than current ILP processors. The RSIM project has been the first to study how ILP processor features affect shared-memory system performance, programmability, and evaluation methodology. This talk will discuss some of our key results on shared-memory performance and simulation methodology.

We find that although ILP features substantially reduce CPU time in shared-memory multiprocessors, they are less effective in reducing memory stall time. Consequently, memory stall time is a larger bottleneck in ILP multiprocessors than in previous-generation systems, and ILP systems see lower speedups. We identify the key reason to be inadequate read miss parallelism exposed by the application to the hardware. We discuss a compiler transformation that increases read miss parallelism by clustering read misses, and without compromising locality. In contrast, previous compiler transformations that enhance locality generally tend to move read misses further apart.

Our analysis also reveals that commonly used simulators based on simple processor models can exhibit large errors when used to model current ILP shared-memory systems (over 100% error in some cases). Unfortunately, simulators that model ILP processors in detail are about ten times slower. We have developed a new simulation technique that alleviates this tradeoff between speed and accuracy. Our new simulator, DirectRSIM, is almost as accurate as detailed ILP simulators, but has a slowdown of only 2.7X relative to the simple-processor simulators. The high errors of simple-processor models and the competitive speed of our new technique suggest that it may be time to abandon the use of simple-processor models when simulating shared-memory systems.