High Performance Computing Research

PIs: Alex Aiken, David Culler, James Demmel, Susan Graham, Paul Hilfinger, Katherine Yelick

Researchers in the Computer Science Division have a long track record of developing algorithms, software, and systems for high performance computing. The two driving issues for this work are usability across a wide range of machines and performance in light of memory and communication hierarchies and irregularities in the application domains. Much of this work is coupled with significant collaborations with applications scientists, and we will highlight the ASCI project below.

Most of the performance improvements we make can be identified as either a reduction in time spent in the communication and memory systems or a reduction in the idle processor time due to work load imbalance. These memory hierarchy problems are particularly challenging on the current generation of high performance machines built as clusters of commodity SMPs, and even more so on a Millennium machine in which multiple such clusters work in concert. In both cases there are several levels of memory hierarchy, including registers, on-chip and off-chip cache, local and remote memory, and disk. Typically, each level of the hierarchical is at least an order of magnitude slower than the level above, so algorithms that are cognizant of the hierarchy perform several times faster than their oblivious counterparts.

The applications under development by the computational scientists on campus present their own set of challenges, including a high communication to computation ratio, dynamic pointer-based data structures, irregular communication patterns, and unpredictable work . Adaptive Mesh Refinement methods, which are used in computational fluid dynamics computations such as those in ASCI, exemplify these issues. An adaptive mesh is a hierarchy of increasingly refined rectangular meshes in 2D or 3D. Most operations involve nearest neighbor computations on the mesh, and most of the work (roughly 80% in the sequential code) is spent on the finest level. In a typical 3D code, each fine mesh patch is roughly 10x10x10, so nearly half of the points are on the surface, and the number of floating point operation between nearest neighbor communication is small (about 10). The patches are not all the same size, and if they are placed on processors to evenly balance load, the neighbors of the surface patches will almost always be on a remote processor. Performance results from existing parallel code indicates that load imbalance and communication costs start to overwhelm the computation on more than 16 processors. In addition, the high communication to computation ratio appears at the instruction level as poor cache utilization. Thus, it is critical to the success of the system software that computer science developments take place on a machine that is large enough to exhibit the memory hierarchy and scheduling issues that will exist on the machines used to solve interesting scientific, engineering and business problems.

We are using a multi-faceted approach to address these problems. At the system level, we have developed lightweight communication mechanisms (Active Messages) and are extending this paradigm to account for shared memory and threads within a single SMP. At the library level, we are developing portable high performance kernels that are automatically tuned for multiple levels of cache (PhiPack), algorithms for sparse (SuperLU) and dense (ScaLAPACK) linear algebra, array and grid libraries for high level languages (Tin), and data structures that adapt to the communication system (Multipol). At the language level, we are collaborating on a "standard" parallel C (based in part on Split-C) and a high performance dialect of C++ (Titanium). Specific to clusters, we are designing lightweight thread packages that are suitable for irregular applications and load balancing techniques for applications with varying degrees of locality concerns and advanced information about work load. Essential to both usability and performance is compiler work (Titanium) that optimizes for caches, uses memory operations in place of communication when possible, takes advantage of weak memory models (e.g., prefetch and cache flush instructions), and is integrated with the libraries through annotations on the library interface regarding optimization opportunities. Assessing the scalability of these techniques requires a machine as large as the campus NOW.

Now we describe the DOE/ASCI (Accelerated Strategic Computing Initiative) project in more detail. This is a proposed Center which DOE would establish as part of its strategic plan to replace nuclear testing with numerical simulations on Teraflop-scale machines. There are 15 faculty from Astronomy, EECS, ME, NE , Mathematics, and Physics collaborating on a large astrophysical simulation. The Astronomy section of this proposal describes some of these simulations in more detail. One of our target architectures is the ``ASCI Red'' machine at Sandia, a 9000+ Pentium cluster, and the campus NOW in Millennium would be an ideal development platform for algorithms before moving to ASCI Red. In particular, AMR and Titanium are central parts of this project, and this proposed Center would create a very large effort in these areas.


February 1999