We believe that the best way to enable novel research based on the use of cost-effective high performance computing throughout the university is put in place a coherent campus-wide system of systems, which we call the Millennium Hierarchical Cluster Architecture. Viewing the Intel equipment in this way, rather than as an ad hoc collection of isolated requests, allows us to address the broader spectrum of needs of this research community and to meet their continued needs as their work advances using the available infrastructure. In the CSD, we have quite a bit of experience building experimental high-performance clusters, through the Network of Workstations, project and supporting external research efforts on this infrastructure, through the Titan NSF research infrastructure grant.
Currently, as part of NOW we have a 105 processor UltraSparc cluster, an older 32-processor SparcStation cluster, a 4x8-processor cluster of SMPs, and a 35-processor Intel PentiumPro cluster with nearly 500 IBM disks. The UltraSparc cluster has a complete parallel programming environment, is used heavily by many research projects, holds the world record for disk-to-disk sorting, broke the RSA 40-bit key challenge, will be the first cluster to appear on the TOP-500 list, and has demonstrated scalability as good as the Cray T3D on the NAS Parallel Benchmarks, with much greater node performance. The NPACI partnership requested that it be provided as a resource to the national community so that computational scientists could gain experience on this emerging class of architectures. We have learned through this hands-on experience that there are clear thresholds of engineering and system administration difficulty with cluster size. A wealth of hardware, software, and system design experience is being translated to the Millennium cluster-of-clusters.
The driving goal for the Millennium design is to enable new investigations in science, engineering, business, and information processing through easy access to high performance computing resources. At the same time, we must recognize the institutional need for autonomy in the research endeavors, the preciousness of time for the caliber of research faculty and students involved, and the relatively high cost of system administration in the university environment. To meet these combined goals, we propose to organize the Intel equipment donation into five progressive levels of performance, sharing, and expected sophistication of the user community. We call these: the desktop, the SMP server, the group NOW, the campus NOW, and Millennium. A smooth transition of computing environment will be provided across the levels, allowing them to be viewed as a coherent system. We have used current Intel products and prices to budget the scale of each of the levels, however, we plan to grow the infrastructure over three years and hope approach a configuration closer to the 1000-processor Millennium that we have in mind.
2.1 Intel/NT Computational Modeling Desktop
The Intel/NT desktop environment is placed directly in front of the key faculty and graduate student researchers. We have seen through several past equipment donations that a new environment is most successful if the researchers are using it day in and day out. Millennium will involve a switch from Unix workstations for many researchers. In addition to the windows system environment, they will need to move over the computational modeling application used in their discipline and adapt to new mathematical libraries, compilers, debuggers, etc.
2.1 SMP NT Server
Each of the research groups involved in the project will have a PentiumPro SMP. This provides a computing resource with substantially more memory and disk capacity than has generally been available to them, or will be available on the desktop, so they can make a significant step forward in the scale of modeling. We intend to provide, as part of the Millennium software architecture described below, a convenient shared address space parallel programming environment in Fortran, C, C++, and Java for these SMPs, so there is a natural step to parallelize the most time consuming sections of the application codes. There will also be standard message passing environments. The Millennium numerical libraries will exploit the SMP, so many of the researchers in the disciplines will gain much of the benefits of parallelism with doing parallel programming themselves.
2.2 Group SMP-based NOWs
Several SMP-based NOWs will be constructed deploying a high-performance system-area network (SAN), in addition to the 100 Mb/s switched Ethernet local-area network. (Currently we use Myrinet for the high-speed interconnect, but commercial SAN technology is evolving rapidly, and we may elect to deploy a newer option, such as Server-Net or Memory Channels.) We intend to migrate the NOW system technology to NT during the project. The group NOWs consist of five 4-way SMPs nodes, with the current catalog. Our experience is that eight nodes, with one spare, are the "sweet spot" of current SAN technology and are able to be administered essentially as a single machine. Using SMP nodes, these turn-key clusters offer substantial computing power to meet the advancing needs of the research efforts.
The group NOWs will be placed in the departments participating in the project and administered primarily by those departments. This provides autonomy and local control over the resource. However, there will be a common hardware and software configuration to reduce the overall system design and administration effort. There are two usage models for these clusters. Most of them will be used as general purpose resources for large-scale computational modeling. Two will be dedicated to providing a specific function: the Digital Library cluster will support queries to a vast resource database to the web, and the Biology cluster will be focused on solving the phylogeny problem.
We will deploy "multiprotocol" versions of the parallel programming environment and math libraries, which exploit the SMP hardware for sharing data between processors on an SMP and utilize the high performance interconnect between clusters. The individual departments will purchase the SAN hardware. The Computer Science Division will provide the drivers and libraries for the fast communication layers as a byproduct of its current research, as well as the programming environment (based on the Castle and Titanium projects), math libraries (from the ScaLapack project) and distributed data structure libraries (from the Multipol and p-Sather projects).
2.3 Campus NOW
As their computational models become more mature and more sophisticated, the most aggressive of the research efforts are expected to grow beyond the capabilities of the group NOWs. To meet this need, we will develop one very large campus-NOW. Using the current catalog, this consists of 72 4-way SMP nodes with 512 or 1 GB of memory per node and one disk per processor. We intend to build it out of Merced-based processors and, as it is to be built in years two and three, will likely be substantially beyond the current size. This system is powerful enough to make a qualitative difference to the scale of modeling that can be conducted. It represents enough of a step beyond the existing UltraSparc NOW in scale to capture the interest of the Computer Science researchers, and yet is a near enough step that we have high confidence in delivering a working system to the disciplines. The computing environment will be identical to that on the group NOW. The only difference is scale. The single server of the group cluster is replaced by a cluster of servers providing various system functions on behalf of the external users.
2.4 Cluster of clusters
The overall organization of Millennium is shown in Figure 1, where ovals indicate clusters, boxes SMP servers, and lines lab workstations. Eight general purpose group NOWs support twelve departments and NERSC. Three schools (Information Management, Business, and Chemistry) have large SMP servers. All of these can evolve their most demanding computational models toward the campus NOW. Two dedicated NOWs provide a single application service. Each of the groups pictured also are provided with NT desktops.
2.5 Software Architecture
Each cluster, including the campus cluster will provide a layered software architecture to support the computational modeling efforts. The base layer consists of the commercial operating system, compiler, debugger and libraries. (Our intention is for this layer to be NT by the end of the project. Is is unclear whether the required system capabilities and library support will be available at the beginning of the project. We currently have a project underway in the CS division. We also have a massive "Software Warehouse" consisting of Unix utilities for several architectures, including support for Solaris, BSDi, and Linux on the Intel platform.) Augmenting this base layer are drivers to support low overhead communication on the high speed network. Upon this nodal layer, global cluster-wide capabilities will be constructed. The global layer provides support for running, debugging, and analysing the performance of parallel programs running over the cluster. It also provides remote execution, load balancing, and batch processing for sequential and parallel jobs. An additional layer will provide support for parallel programming, including message passing (MPI) libraries, shared-addressed based C and Java programming languages, and distributed data structure libraries. Upon this layer are two levels of mathematical library support. The ScaLapack work forms the basis of the core parallel linear algebra routines. Higher level mathematical libraries will be supported in conjunction with the NERSC future technologies group, as part of its DOE 2000 work. Together, these layers form the cluster software environment. To a large extent, researchers will utilize their group cluster or, later in the project, the campus cluster. However, there will naturally be a desire to share resources more effectively across clusters. We intend to put in place a facility that will allow groups to utilize their resources cooperatively, this will use either the Legion/Nexus/Globus work that is currently part of NPACI or our own WebOS facilities.
February 1999