This paper describes PLTO (``Pluto''), a link-time optimizer for Intel Pentium processors. PLTO automatically improves the performance of parallel scientific programs that use the MPI library. Input to PLTO is a binary program consisting of object code for the application and for the MPI library. PLTO produces a functionally equivalent binary program that executes faster, even when the object code for the application and for the library have already been highly optimized. The main techniques that PLTO employs are specializing the code in MPI library routines to account for exactly how they are used by an application, inlining the code to avoid function call overhead and to eliminate redundant loads and stores, and laying out the code to improve instruction cache performance. These techniques are general---they can be used for any application and any communication library.