For immediate release, August 22, 2003, Lexington, Kentucky: Researchers at the University of Kentucky have constructed and demonstrated an innovative new, scalable, parallel supercomputer that achieves application performance of more than 1 billion floating point operations per second (GFLOPS) for every $100 spent on building the machine. The approach used to design and build this machine makes it cost-effective for solving a wide range of problems, from drug design using computational chemistry to design of quieter printers using computational fluid dynamics (CFD). Thus, this breakthrough is not only a milestone, but also will enable many more scientists and engineers to use computational models.
A decade ago, supercomputers cost about a $1,000,000 per GFLOPS performance. By using standard PC parts, "Beowulf" cluster supercomputers dramatically reduce the cost, but as processors and other components have become faster and cheaper, the network needed to coordinate them has become relatively expensive. The University of Kentucky researchers made their first breakthrough in reducing network cost in May 2000, when KLAT2, Kentucky Linux Athlon Testbed 2 (http://aggregate.org/KLAT2/) used standard 100mb/s Fast Ethernet hardware in the world's first machine-designed asymmetric cluster network -- and achieved $640 per GFLOPS, breaking the $1,000 per GFLOPS barrier. Their newest machine, KASY0, Kentucky Asymmetric Zero (http://aggregate.org/KASY0/), uses a more advanced type of asymmetric network design to break the $100 per GFLOPS barrier.
A well-known reference for supercomputer performance is http://top500.org/, which lists the 500 supercomputers that obtain the highest GFLOPS speed executing a Linpack benchmark program. Performance on that program depends partly on the theoretical peak GFLOPS of the processors, but also on the parallel implementation and efficiency of the network that allows the processors to work together. In the current (June 2003) list, most systems use expensive, specialized, network hardware. The machines explicitly listed as using standard 100mb/s Fast Ethernet achieve an average of less than 8.5% of peak. The average for the systems listed as using Gigabit Ethernet is somewhat better, at about 30% of peak. In contrast, KASY0's 100mb/s Fast Ethernet network allows it to achieve 187.3 GFLOPS, over 35% of peak using a double-precision version of the benchmark (HPL). Using a single-precision version, the $39,454.31 KASY0 obtains over 471.5 GFLOPS, more than 44% of its theoretical peak and less than $84 per GFLOPS.
The remarkable thing about KASY0's price/performance is that, while network hardware is often the dominant cost for a system of its size (128 plus 4 spare nodes), less than 11% of the system cost went for the network hardware. The AMD Athlon XP 2600+ processors were more than 35% of the total system cost; memory was 21%. Even more significantly, the network design technology that made this possible can be applied with similar benefit to cluster supercomputers with thousands of nodes. KLAT2's network was the world's first Flat Neighborhood Network; the enhanced version used for KASY0 is the world's first Sparse Flat Neighborhood Network (SFNN). KASY0 also is the first supercomputer to have its physical node and switch placement optimized by a computer program. FNN design technology and tools have been freely available and used by various other groups; so too will the new SFNN technology be freely available.
KASY0 is not a toy or a "hack" -- it is a serious demonstration of a fundamental new advance in network design. The only other supercomputer we have seen claim close to the price/performance measured for KASY0 is this $50,000+ system built by the National Center for Supercomputing Applications (NCSA) using 70 PlayStation2 units. Not only does KASY0 have a vastly superior network and significantly higher peak floating point performance per node, but KASY0's lower price yields many more nodes and real application performance, not just high peak numbers.
For example, KASY0 also has set a new world record for rendering a complex image using the Persistence of Vision Raytracer (POV-Ray). Executing pvmpovray 3.5 on KASY0 to render the standard benchmark.pov scene yielded a time of 72 seconds. According to this site, the previous record was 107 seconds set on August 1, 2003 by a cluster costing $79,000.
The primary architect of KASY0 is Tim Mattox, a research assistant who has been developing the Sparse Flat Neighborhood Network concept for his Ph.D. thesis. As an educational experience available to anyone, the physical construction of KASY0 was done entirely by volunteers at the University of Kentucky.
From the creation of the first Linux PC cluster in February 1994 to the construction of KASY0, Hank Dietz and his students have continued to improve cluster performance by making compilers, hardware architecture, and operating system work together more efficiently. At the University of Kentucky, as Professor of Electrical and Computer Engineering and James F. Hardymon Chair in Networking, Dietz's goal is to develop and freely diseminate the new technologies that will allow scientists and engineers to solve their most important computational problems.
As of September 16, 2003, we've done some additional tuning of the software that allows KASY0 to get 482.6 GFLOPS instead of the 471.5 reported above. This brings performance to over 45% of theoretical peak and less than $82 per GFLOPS. Our newly tuned SGEMM is being incorporated into Automatically Tuned Linear Algebra Software (ATLAS).
If you have any questions or comments, contact:
Professor Hank Dietz, James F. Hardymon Chair in Networking University of Kentucky College of Engineering Electrical and Computer Engineering Department 453 Anderson Tower (Office 469 Anderson Tower, Labs 672, 695, and 577 Anderson Tower) Lexington, KY 40506-0046 Office Phone: (859) 257 4701 Fax : (859) 257 3092 Email: firstname.lastname@example.org Home URL: http://aggregate.org/hankd/
The only thing set in stone is our name.