As soon as KLAT2's hardware was stable, we wanted to show that it truly is a substantial supercomputer. Thus, following the guidelines set forth on http://www.top500.org/, we obtained the software for the "Linpack Benchmark" and proceeded to optimize the BLAS to use 3DNow! SIMD Within A Register (SWAR). Also in accordance with the guidelines, we will make our optimized BLAS available for others to use; rather than distributing a binary object, we plan to distribute a public domain set of source code patches.
KLAT2's complete results are:
Rmax=64.459 GFLOPS with Nmax=40,960 (N=40,960, Blocking factor 64, 8x8 grid; LU time 709.98s, sol time 0.78s; residual is 0.000000) Approximate N1/2 is 32.296 GFLOPS with N=13,824 (Blocking factor 64, 8x8 grid; LU time 54.30s, sol time 0.24s; residual is 0.000000) Theoretical peak is 179.2 GFLOPS (we'll never see that!) Machine configuration: 64 Athlon 700MHz with 128MB PC100 CAS2 SDRAM on FIC SD11 motherboards (boot floppy, no hard disk); Flat Neighborhood network implemented using 256 Smartlink 100Mb/s NICs and 9 32-way switches; RedHat Linux 6.0 with 2.2.14 kernel; LAM MPI 6.3.3b1 (with Flat Neighborhood patch); Egcs 2.91.66, G77 0.5.24-19981002, and our 3DNow! SWAR support (http://aggregate.org/SWAR/); ScaLAPACK 1.6; BLACS 1.1; ATLAS 3.0beta (with our 3DNow! code inserted by hand)
Relative to the current (Fall 1999) list, this performance is slightly better than the machine ranking 150th. Unfortunately, after submitting our performance numbers, we were informed that, although the WWW site does not specify this, only "64-bit precision" performance values are now accepted for Jack Dongarra's Top500 list. Thus, KLAT2 will not be listed. While KLAT2's IA32 legacy 80-bit (which can be rounded to 64-bit double-precision when stored in memory) floating point performance is also fairly spectacular, the machine was not designed to optimize that performance, so we would need more SDRAM and slightly faster Athlons to rank in the top 500 that way. We would have built KLAT2 with that configuration had we known that 32-bit 3DNow! performance would be rejected from the list, but it is too late now.
In any case, the configuration we did build is at a slightly better price/performance point than it would have been if we had optimized the design for IA32 legacy 80-bit performance. On that list or not, KLAT2 achieves its performance goal, clearly demonstrating that smart network design and efficient use of features like 3DNow! really do give awesome performance with a very low cost. KLAT2 is the first supercomputer to break the $1K/GFLOPS barrier; it actually costs less than $650 per GFLOPS delivered on 32-bit ScaLAPACK.
The only thing set in stone is our name.