Filaments-Related Publications

Filaments-Related Publications

This page provides links to postscript copies of all papers listed.
Dawson R. Engler, David K. Lowenthal, and Gregory R. Andrews Shared Filaments: Efficient Fine-Grain Parallelism on Shared-Memory Multiprocessors. TR 93-13a, University of Arizona, April 1993 (187K bytes, 84K compressed)

Abstract

It has long been thought that coarse-grain parallelism is much more efficient than fine-grain parallelism due to the overhead of process (thread) creation, context switching, On the other hand, there are several advantages to fine-grain parallelism: architecture independence, ease of programming, ease of use as a target for code generation, and load-balancing potential. This paper describes a portable threads package, Filaments, that supports efficient execution of fine-grain parallel programs on shared-memory multiprocessors. Filaments supports three kinds of threads---run-to-completion, barrier (iterative), and fork/join---which appear to be sufficient for scientific computations. Filaments employs a unique combination of techniques to achieve efficiency: stateless threads, very small thread descriptors, optimized barrier synchronization, scheduling that enhances data locality, and automatic pruning of fork/join threads. The gains in performance are such that on an application such as Jacobi iteration, the execution time for a fine-grain program with a worst-case granularity of a thread per point can be within 10% of that for a coarse-grain program with only one task per processor. Execution times for problems with more work per thread are usually indistinguishable from coarse-grain programs, and they can be faster when the amount of work per thread varies.


Vincent W. Freeh. A Comparison of Implicit and Explicit Parallel Programming. TR 93-30a, University of Arizona, June 1994 (271K bytes, 108K compressed)

Abstract

The impact of the parallel programming model on scientific computing is examined. A comparison is made between Sisal, a functional language with implicit parallelism, and SR, an imperative language with explicit parallelism. Both languages are modern, high-level, concurrent programming languages. These two concurrent languages are also compared to programs written in C, using library calls for parallelism. The languages are evaluated by writing programs for six different scientific applications in each language. These programs are compared for programmability and performance on a shared memory multiprocessor.


Vincent W. Freeh, David K. Lowenthal, and Gregory R. Andrews. Distributed Filaments: Efficient Fine-Grain Parallelism on a Cluster of Workstations. First Symposium on Operating Systems Design and Implementation, p. 201-212, Monterey, CA, November 14-17, 1994 (217K bytes, 89K compressed)

Abstract

A fine-grain parallel program is one in which processes are typically small, ranging from a few to a few hundred instructions. Fine-grain parallelism arises naturally in many situations, such as iterative grid computations, recursive fork/join programs, the bodies of parallel {\tt FOR} loops, and the implicit parallelism in functional or dataflow languages. It is useful both to describe massively parallel computations and as a target for code generation by compilers. However, fine-grain parallelism has long been thought to be inefficient due to the overheads of process creation, context switching, and synchronization. This paper describes a software kernel, Distributed Filaments (DF), that implements fine-grain parallelism both portably and efficiently on a workstation cluster. DF runs on existing, off-the-shelf hardware and software. It has a simple interface, so it is easy to use. DF achieves efficiency by using stateless threads on each node, overlapping communication and computation, employing a new low-overhead, reliable datagram communication protocol, and automatically balancing the work generated by fork/join computations. The net result is excellent speedup on a variety of applications.


David K. Lowenthal and Gregory R. Andrews. Adaptive Data Placement for Distributed-Memory Machines. To appear, 10th International Parallel Processing Symposium, Honolulu, Hawaii, April 15-19, 1996 (192K bytes, 77K compressed)

Abstract

Programming distributed-memory machines requires careful placement of data on the nodes. This is because achieving efficiency requires balancing the computational load among the nodes and minimizing excess data movement between the nodes. Most current approaches to data placement require the programmer or compiler to place data initially and then possibly to move it explicitly during a computation. This paper describes a new, adaptive approach to data placement. It is implemented in the Adapt system, which takes an initial data placement, efficiently monitors how well it performs, and changes the placement whenever the monitoring indicates that a different placement would perform better. Using Adapt can simplify the programming of parallel systems and simplify compilers for parallel languages such as HPF. In particular, Adapt frees the programmer from having to specify data placements, and it frees the compiler from having to do often complex analysis to determine a good placement. Moreover, Adapt supports a new ``variable block'' placement, which is especially useful for applications with nearest-neighbor communication but an imbalanced workload. For applications in which the best data placement varies dynamically, using Adapt can lead to much better performance than using any statically determined data placement. We present the performance of Adapt on three scientific applications that require different data placements.


Vincent W. Freeh and Gregory R. Andrews. fsc: A Sisal Compiler for Both Distributed- and Shared-Memory Machines. To appear, Conference on High-Performance Functional Computing, Denver, CO, April 9-11, 1995 (173K bytes, 68K compressed)

Abstract

This paper describes a prototype Sisal compiler that supports distributed- as well as shared-memory machines. The compiler, fsc, modifies the code-generation phase of the optimizing Sisal compiler, osc, to use the Filaments library as a run-time system. Filaments efficiently supports fine-grain parallelism and a shared-memory programming model. Using fine-grain threads makes it possible to implement recursive as well as loop parallelism; it also facilitates dynamic load balancing. Using a distributed implementation of shared memory (a DSM) simplifies the compiler by obviating the need for explicit message passing.


David K. Lowenthal, Vincent W. Freeh, and Gregory R. Andrews. Using Fine-Grain Threads and Run-Time Decision Making in Parallel Computing. To appear, Journal of Parallel and Distributed Computing, Special Issue on Multithreading for Multiprocessors (244 bytes, 97K compressed)

Abstract

Programming distributed-memory multiprocessors and networks of workstations requires deciding what can execute concurrently, how processes communicate, and where data is placed. These decisions can be made statically by a programmer or compiler, or they can be made dynamically at run time. Using run-time decisions leads to a simpler interface---because decisions are implicit---and it can lead to better decisions---because more information is available. This paper examines the costs, benefits, and details of making decisions at run time. The starting point is explicit fine-grain parallelism with any number (even thousands) of threads. Five specific techniques are considered: (1) implicitly coarsening the granularity of parallelism, (2) using implicit communication implemented by a distributed shared memory, (3) overlapping computation and communication, (4) adaptively moving threads and data between nodes to minimize communication and balance load, and (5) dynamically remapping data to pages to avoid false sharing. Details are given on the performance of each of these techniques as well as their overall performance on several scientific applications.


David K. Lowenthal, Vincent W. Freeh, and Gregory R. Andrews. Efficient Fine-Grain Parallelism on Shared-Memory Machines. TR 96-1, University of Arizona, January 1996 (199K bytes, 80K compressed)

Abstract

A coarse-grain parallel program typically has one thread (task) per processor, whereas a fine-grain program has one thread for each independent unit of work. Although there are several advantages to fine-grain parallelism, conventional wisdom is that coarse-grain parallelism is more efficient. This paper illustrates the advantages of fine-grain parallelism and presents an efficient implementation for shared-memory machines. The approach has been implemented in a portable software package called Filaments, which employs a unique combination of techniques to achieve efficiency. The performance of the fine-grain programs discussed in this paper is always within 13% of a hand-coded coarse-grain program and is usually within 5 percent.