Frequently Asked Questions about KASY0

As if you didn't know, this document is always under construction....

What's in a name?

How do you pronounce KASY0?
What does the name mean?
Is KASY0 really a "Beowulf?"

Performance and benchmarking

What performance does KASY0 really get?
By the way, what is a GFLOPS?
What machines have comparable price/performance?
What are KASY0's SETI/other-WWW-thingie numbers?
Any rendering stuff on KASY0, e.g., POVRAY?
Is KASY0 really the first supercomputer under $100/GFLOPS?
Somebody reports that KASY0 ran "peppier" under Windows than under Linux?

Technical configuration information

What's inside KASY0?
What are the distinguishing features of KASY0?
Why did you use Athlons?
Shouldn't each node have more memory?
How does KASY0 boot without any disks?
How do you do I/O to hard disks, video, etc.?
Why didn't you use gigabit network hardware?
Why didn't you use a different network topology?
What is the difference between an FNN and an SFNN?
How does SFNN routing really work?
What about your choice of NICs?
Where are the keyboards, video cards, mice, etc.?

Other things that bother people

How do you power and cool KASY0?
Do I see two empty holes on the side of the cases where fans should be?
Maybe UK can have students work for pizza, but how much would it cost if you paid people to assemble KASY0?
What am I looking at in that photo of KASY0?

Don't see your question above? You can send email to tmattox@NOSPAMieee.org and/or hankd@engr.uky.edu.
Maybe we'll even answer. ;-) (remove the NOSPAM for a valid e-mail address)

How do you pronounce KASY0?

We pronounce KASY0 like "Kay-See Zero."

It is not "Kay-See-Oh" and any computer person should be ashamed of even thinking that "0" is equivalent to "O". ;-)

What does the name mean?

KASY0 stands for Kentucky Asymmetric Zero. There are plans for a sequence of clusters with progressively more asymmetric networks, as our design technology becomes sufficiently advanced.

In many ways, KASY0 is the logical next step beyond KLAT2 (Kentucky Linux Athlon Testbed 2), which showcased the world's first deliberately asymmetric cluster network. KASY0 could have been called KLAT3, but there was never a KLAT1 because KLAT2 was an obscure reference to Klaatu, the fellow from outer space in the classic 1951 science fiction movie The Day The Earth Stood Still.

Is KASY0 really a "Beowulf?"

Yes. Every hardware component is an unmodified commodity subsystem, available from multiple vendors.

What performance does KASY0 really get?

It really gets over 471 GFLOPS on a 32-bit version of HPL. Actually, as of September 16, 2003, we've done some additional tuning of the software that allows KASY0 to get 482.6 GFLOPS; our newly tuned SGEMM is being incorporated into Automatically Tuned Linear Algebra Software (ATLAS). Using an "untuned" 64/80-bit version, KASY0 gets a very respectable 187.3 GFLOPS. These aren't theoretical numbers, they are the real thing. The theoretical we-will-never-see-that numbers are 531 GFLOPS and 1.06 TFLOPS, respectively, for 64/80-bit and 32-bit floating point.

Yes, we know HPL is only one application and not a very general one at that. We have other stuff running as well... but most of what we do is computer system design. Thus, our primary applications tend to be things like the Sparse Flat Neighborhood Network design program, which nobody else yet has. The result is that the performance numbers that are most important to us are meaningless to anybody else.

By the way, what is a GFLOPS?

Good question. Everybody agrees that a GFLOPS (pronounced "Gig-Ah-Flops") is a billion (1,000,000,000) floating-point operations per second. It's less clear why we don't write it as "BFlOpS," but hey, it's not my fault. The "S" is capitalized for seconds; plural of FLOP is done with lowercase "s" as FLOPs. However one writes it, there are two major ambiguities about the definition of GFLOPS:

What constitutes a floating point operation (a FLOP)?

Obviously, any operation on a floating-point value, right? Well, yes and no. Computing the sin(x) is really done by computing a series expansion, so does sin(x) count as one FLOP or as however many addition/subtraction/multiplication/division operations are in the series expansion used? The opposite effect occurs with things like the "Cray FLOP" measure: an absolute value operation on an old Cray was implemented by a three-instruction sequence, so it was 3 FLOPs; however, everybody knows all you have to do is zero the sign bit, which takes only a single (integer) bitwise-AND instruction -- no FLOPs at all. How you count can make a huge difference. If your code only does addition and multiplication, there is general agreement on how you count those... but even a subtract causes ambiguity about one subtraction versus an addition of the negative.

The Top500 list essentially defines FLOPs by a formula that counts additions and multiplications assuming a particular algorithm, but in the past it allowed other algorithms (see Jack Dongarra's January 18, 2001 report and note the NEC and Cray machines using Strassen's Algorithm) but overly-generously counted the FLOPs as though Gaussian Elimination were used. For our Linpack (tuned HPL) performance numbers, we use the standard Gaussian Elimination algorithm and quote the FLOPs counted by the Top500's formula.

Floating-point representations are inherently imprecise; how accurate do the operation results have to be?

Obviously, accurate enough. ;-) Unfortunately, it is very difficult to determine how much accuracy remains after a non-trivial computation is performed using a specific precision, yet precision (number of bits used to store a value) is all that one can directly control. An excellent overview is given in What Every Computer Scientist Should Know About Floating-Point Arithmetic; it isn't exactly light reading, but at least it's lighter than the IEEE 754/854 standards. The standards provide for different bases (e.g., 2, 10, 16), rounding modes, predictive infinities, NaN (Not-a-Number), denormalized arithmetic, etc. The result is that fully compliant implementations of floating point can have a very wide range of accuracy... and there also are many "slightly" non-compliant versions that omit some of the more complex features (which have very little impact, if any, on accuracy). Grossly inferior accuracy, such as the old Crays yielded, is essentially gone from modern machines except for explicit options to perform low-precision versions of inverse, square root, or inverse square root.

Although the Top500 list has a history of accepting whatever floating point was native to the machine (again, see Jack Dongarra's January 18, 2001 report), the latest Top500 FAQ includes an attempt to specify "64 bit floating point arithmetic" -- but, as discussed above, that isn't a well-defined thing. Another interesting point is that 64-bit isn't always 64-bit: because PCs have complicated operations (e.g., sin(x)) implemented as single instructions, x87 floating point registers actually hold 80-bit results that get converted to 64-bits (or even 32-bits) when stored into memory. Thus, PCs actually get significantly higher accuracy than machines with true 64-bit floating-point... sometimes, even 32-bit stored results are more accurate than 64-bit values computed on other machines! The analysis is even more complex in that PC processors using 3DNow!, SSE, and SSE2 do not promise use of an 80-bit internal representation.

In summary, few real-world phenomena can be directly measured to accuracies much greater than the 24-bit mantissa of a typical 32-bit IEEE floating-point number. (Ever see a 24-bit linear Analog-to-Digital converter?) Thus, single precision (roughly 32-bit) values are useful for most carefully-coded floating-point algorithms. For example, the Computational Fluid Dynamics (CFD) code that got us a Gordon Bell award in 2000 works beautifully with 32-bit 3DNow! arithmetic. Double precision allows one to be slightly sloppier about accuracy analysis and also provides a significantly wider dynamic range (more exponent bits). Half precision 16-bit values are now commonly used in DSP applications. To us, all these are valid "FLOP precisions" -- but you should specify which you're counting, and we do.

Oh yes... you also need to specify what program code you're running because some codes do lots of useful stuff that isn't FLOPs, but the above discussion is already rather long.... ;-)

What machines have comparable price/performance?

At the time KASY0 was built, there were two other contenders. The one we knew about before issuing our press release:

this $50,000+ system built by the National Center for Supercomputing Applications (NCSA) using 70 PlayStation2 units. Even by peak 32-bit numbers, it would be about $110/GFLOPS.

The other system isn't really close, but since some people (on SlashDot, with limited arithmetic skills ;-) claimed it was better, here's the other one:

McKenzie is a 512-processor (256 dual Xeon 2.4 GHz nodes) cluster with an "innovative cubic network topology" combining gigabit and 100Mb/s Ethernets. It is ranked 38 (they said 39?) in the June 2003 Top500 supercomputers list. The quoted price/performance (1.2TFLOPS at a cost of $900K Canadian) is $0.75/MFLOPS. That's good, but it's $750/GFLOPS, well over $100/GFLOPS. Back in 2000, KLAT2 came in at $640/GFLOPS for 32-bit precision Linpack; I'm assuming McKenzie is quoting 64-bit performance, in which case the comparable number for KLAT2 was $1807/GFLOPS. In summary, price/performance for McKenzie is nearly 2.5X better than KLAT2... but KASY0 is over 7.5X better (single; over 8.5X better double) than KLAT2.

Another interesting, if somewhat outdated, data point is Price/Performance in High-Performance Technical Computers and the 2Q00 update. These documents quote price/performance (for peak GFLOPS) as ranging from $6,300 to $49,000 per GFLOPS in the middle of 2000... obviously, they missed KLAT2 (which, in April 2000, for 64/80-bit GFLOPS, was 89.6GFLOPS at a price/performance of $460/GFLOPS). More interesting is this cover quote: "For a fixed budget of $500,000 a user can purchase over 75 peak GFLOPS today and will be able to purchase over 600 peak GFLOPS in Mid 2005." KASY0 comes in at a peak of 531.2 GFLOPS (or 1,062.4 GFLOPS, 32-bit) for $39,454.31... no worse than $74.27/GFLOPS. Both KLAT2 and KASY0 are at least an order of magnitude better price/performance than the best they expected... and it's only Mid 2003. ;-)

As of Fall 2003, a third contender has been cited: the Terascale Cluster at Virginia Tech. That system comes in at no better than about 3X higher cost per unit performance than KASY0. Despite having significantly poorer price/performance than KASY0, we are frankly very impressed by how good the price/performance of that cluster is given the use of a conventional network architecture implemented with infiniband. I'm sure we'll all be hearing a lot more about this cluster, Apple G5, and infiniband....

As of Summer 2004, yet another contender has been cited: the General-Purpose Computation on GPUs group at at SUNY Stony Brook. In the paper (to appear at SC2004, but widely publicized and available now on the WWW) GPU Cluster for High Performance Computing, at the end of section 3, the claim is made that "We therefore get in principle 41.1 Mflops peak/$." However, that number is actually computed by saying that the addition of 32 GPUs into their cluster would increase theoretical peak by 512 GFLOPS at a cost of $12,768... completely ignoring the cost of the cluster needed to host the GPUs! The actual system they built, minus some hardware they say is not used in their application, is quoted as having a theoretical peak of 832 GFLOPS for a cost of $136,000. Since these are 32-bit GFLOPS, the comparable numbers for KASY0 using 3DNow! are 1,062.4 GFLOPS for a cost of $39,454.31. Thus, the actual peak MFLOPS per dollar for their system is 6.12 and for KASY0 is 26.93. In summary, KASY0 has 4.4X better price/performance than their GPU-augmented cluster.

For what it is worth, we do believe that augmenting a cluster with GPUs is a reasonable approach, but the price/performance margin for GPUs isn't great. In addition to careful design of the cluster, the approach we are taking for use of GPUs reduces the cost overhead of host nodes by placing multiple GPUs in each host... which doesn't really work without multiple PCI Express slots per motherboard and some system software support, neither of which is really ready at this writing.

What are KASY0's SETI/other-grid-thingie numbers?

KASY0 has a bunch of fast processors that are connected to the Internet, so we can indeed run all those wonderful "useful screensavers that don't need a screen." In case you had not guessed, we don't consider those to be very good tests of a supercomputer's abilities. Being fast at "parallel" programs that virtually never communicate between nodes always has been easy.

Although we have occassionally used such applications as burn-in tests, we had more important things ready to run on KASY0.

Any rendering stuff on KASY0, e.g., POVRAY?

There is a new POVRAY benchmark that we will run and post results for here. The interesting issue is that the latest version of POVRAY isn't compatible with the POVRAY version patched to run in parallel using MPI, our prefered messaging software environment. The (older) PVM version oddly enough has been patched to work with the newest POVRAY. We will publish results either using the PVM version or a fixed MPI version.

At this writing, a preliminary run of PVM-POVRAY 3.5c with the new standard test image (benchmark.pov) renders in 72 seconds -- a new record fastest time as of August 22, 2003. Oddly, much of this benchmark's complexity is really just the parsing of the very complex scene description, not rendering per se. Worse still, parsing is not parallelized.

Is KASY0 really the first supercomputer under $100/GFLOPS?

To the best of our knowledge, KASY0's claim is unchallenged for a general-purpose supercomputer. The closest price/performance competitor is probably this $50,000+ system built by the National Center for Supercomputing Applications (NCSA) using 70 PlayStation2 units... which, using theoretical peak speed, would be about $110/GFLOPS using single-precision. In contrast, KASY0's theoretical peak speed is slightly more than 3X better, at just over $37/GFLOPS; using 64/80-bit precision, KASY0's peak is better than $75/GFLOPS. On a single-precision version of HPL, KASY0's measured performance yields better than $84/GFLOPS.

Although KASY0 was not complete until after the paper submission deadline for a 2003 Gordon Bell Award, we submitted a paper that described the technology that makes KASY0 so cost-effective and indicated in the paper that more detailed results would be added before the final paper was due. Unfortunately, the paper without the detailed results was rejected from SC03. It is very unfortunate that things done during the peak academic research time -- the summer -- are too late to be considered for that year's Gordon Bell Awards (which are given in November)....

Somebody reports that KASY0 ran "peppier" under Windows than under Linux?

It was brought to our attention that somebody said:

The researchers loaded WinXP on the thing and described its performance as "peppier."

We have never used Windows XP on KASY0 or any other cluster, nor have we seriously considered doing so. In fact, it isn't clear how efficiently any version of Windows could be made to use KASY0's SFNN and we certainly have not invested any effort in writing SFNN drivers for Windows. If Microsoft would like to pay for us to look at that, we would be willing, but nobody has ever told us that they want Windows support for FNNs or SFNNs -- people have always wanted support for some version of unix (although not always Linux).

RedHat Linux 9 is a bit bloated, but we believe that it still is far more suitable as a cluster supercomputer node OS than any version of Windows: it is easier to strip down. There are other unix variants that also are quite viable. Where Windows has an advantage is primarily in running interactive desktop applications that Microsoft sells to operate on their own proprietary file formats... and that advantage is significant, but not relevant to cluster supercomputing. Microsoft also has done a pretty good job in providing software management facilities for large numbers of PCs, although a node in a well-designed cluster isn't exactly a PC (despite being made from PC parts). In summary, if Microsoft would like to be able to use SFNN technology, we're happy to help them, but nobody has offered to pay for us to make Windows support happen.

What's inside KASY0?

KASY0's configuration is:

128 + 4 "cold spare" PC nodes, each containing:
- One AMD Athlon XP 2600+ (the 2.075GHz version)
- One 512MB PC2700 DDR SDRAM
- BioStar M7VIT Pro motherboard
- Two Linksys LNE100TX NICs
- Codegen 6042L case with 400W power supply
18 BenQ SE0024 24-port Fast Ethernet switches
405 Cat5 Fast Ethernet cables
RedHat Linux 9.0, modified Warewulf 1.11

For details see our cost table.

What are the distinguishing features of KASY0?

Several aspects of KASY0 are new:

It uses a new asymmetric network architecture called a Sparse Flat Neighborhood Network (SFNN). When we invented the Flat Neighborhood Network (FNN) in 2000, it won a variety of awards because it significantly reduced cost for a low-latency high-bandwidth network... but there was a limited scaling range for which the full benefits could be achieved. The SFNN concept not only yields significantly cheaper networks within the FNN scaling range of up to several hundred nodes, but also scales the performance benefits to at least 10,000 nodes.
It is the first supercomputer to have its physical layout optimized by computer (using a genetic search algorithm).
It is the first supercomputer to use our new FNN advanced routing software.
It is the first supercomputer to use a version of Warewulf that we modified to use symbolic links to dramatically reduce the size of the RAMDISK image.

Depending on when we build our next system, there may be additions to that list....

Why did you use Athlons?

Most people know that we are very fond of Athlons; we were the first people to publically display a cluster using them. We love the floating point performance, 3DNow!, the highly-effective hardware code scheduling, and, most significantly, the price/performance. A supercomputer is a design that converts computer price/performance into raw performance.

Why not use Opterons? Well, we are very impressed by Opterons (especially the new memory pipeline), but being impressed doesn't mean Opterons yield the best price/performance. An Opteron and an Athlon have essentially the same core arithmetic performance at the same clock rate, and Athlons still have a faster clock and a lower price. Running code that is not well-tuned for cache usage, the Opteron would clearly win... but our codes are very carefully tuned to get near peak performance, which includes very careful cache tuning. We expect the Athlon 64 processors to have much more favorable price/performance for use in clusters.

The other little surprise is that KASY0's Athlons are the "old" (Thoroughbred) version of the XP, not the latest (Barton) version. Why? AMD's performance ratings are based on typical system performance on a wide range of codes, including codes that are not tuned for good performance. Thus, when AMD made changes like adding more cache to Barton, they gave the chips higher performance ratings at a lower clock rate. For highly-tuned codes, the Thoroughbred version of the Athlon XP has a higher clock rate for its core at a lower performance number, yielding higher performance, especially for floating-point arithmetic. The faster-clock Thoroughbreds are generally cheaper than the Bartons because "typical" application performance really does favor the larger caches... which means supercomputing folks get a bargin price for the processors that are fastest for well-tuned supercomputer applications.

Even though the Thoroughbred AMD Athlon 2600+ processors cost only about $100 each, they were more than 35% of the total system cost! Perhaps the best thing about the new SFNN technology is that, using it, you get to spend a much larger than normal fraction of your budget on the processors.

As of August 2004 -- a year after KASY0 was built -- an interesting little additional note is appropriate. AMD is phasing-out the Barton-based Athlon XP processors and is essentially replacing them with the new Sempron family. What makes that so interesting is the fact that the Sempron 2400+, 2500+, 2600+, and 2800+ are in fact using the Thoroughbred Athlon XP core. To be precise, the processors used in KASY0 are back in production under the Sempron 2800+ name with a list price of $122 (as of August 20, 2004). This rather strange turn of events actually makes a lot of sense: the Barton's lower clock and bigger die (thus, higher cost) make it inferior to Thoroughbred for things that can live in cache, while the new memory pipeline in the AMD64 line blows away the Barton for things that can't live in cache. One could get upset about the part number inflation in the Semprons -- 2600+ became 2800+ -- but a rose by any other number.... ;-)

Shouldn't each node have more memory?

Of course they should. ;-) However, 512MB each is a reasonable number for now and easy for us to upgrade later.

How does KASY0 boot without any disks?

The built-in network interface on the motherboard supports PXE network booting. Unfortunately, the BIOS settings had to be changed from the default to enable this feature. Since we had used Etherboot on KLAT2, it was simpler to use a two stage boot, using first PXE and then Etherboot. The Etherboot bootloader then loads a Linux Kernel and a compressed ramdisk image setup with the Warewulf package. The boot server for KASY0 is built with the same components as a node, with the addition of two more NICs, another 512MB PC2700 DIMM (1 GB total), and a 120 GB IDE hard disk.

How do you do I/O to hard disks, video, etc.?

Since our current research work does not require local disks, KASY0 nodes are completely diskless; there isn't even a floppy. Disk space is "borrowed" from one of several servers or from another cluster (also in our lab) that has a disk on each node. The same procedure is used for video, using either a workstation or a cluster video wall to display things from KASY0. All of this can be done via relatively "thick" connections behind our lab firewall.

Addition of disk drives to all nodes will be done as an upgrade if we find our applications need local disks, but we have no plans to add video displays to KASY0: that's better done with a separate video wall cluster.

Why didn't you use gigabit network hardware?

We did seriously consider building KASY0 using Gigabit Ethernet, most likely 8-port switches. In some sense, it would have been an even better showcase of SFNN technology, because the switch width would have been an even smaller fraction of the node count. There were three main reasons we used 100Mb/s Fast Ethernet:

Although cheap Gigabit Ethernet switches were theoretically available, backorders pushed delivery dates too late.
For reasons that we do not fully understand, Gigabit Ethernet hardware has been observed to yield slightly higher latency than 100Mb/s Fast Ethernet. The higher link bandwidth doesn't really yield much higher total system (bisection) bandwidth in an SFNN configuration. In summary, our conclusion was that Gigabit Ethernet would not yield enough additional network performance over 100Mb/s Fast Ethernet to compensate for its higher cost per node. We expect that to change soon....
Available budget would have limited us to 64 nodes thanks to the higher cost of Gigabit Ethernet NICs.

Options such as Myricom's Myrinet and Dolphin's Wulfkit are not really options -- both use nics that cost more than 3X what an entire node in KASY0 costs (even including 1/128th of the network in the cost of each KASY0 node).

Why didn't you use a different network topology?

The whole reason we built KASY0 was to demonstrate and experimentally evaluate the new Sparse Flat Neighborhood Network (SFNN) concept. Beside that, nothing is even close to the performance at anything approaching the SFNN's low cost.

What is the difference between an FNN and an SFNN?

An SFNN is a Sparse FNN, which makes it somewhat cheaper and much more scalable than an FNN.

Fundamentally, an FNN is a network that ensures all possible pairs of nodes can communicate with single-switch latency. An SFNN only ensures single-switch latency for a specific set of node pairs. The set is generally specified as all pairings that occur in any of a given set of communication patterns. The interesting thing is that, even taking the union of all the communication patterns commonly used in parallel processing, the number of node pairs doesn't grow as fast as O(N*N) -- so the network complexity increases relatively slowly.

Node pairs that were not expected to communicate still can communicate with good bandwidth in an SFNN, but they might see higher latency.

The communication patterns specified, and KASY0's resulting network design, is given in the following output from our SFNN design tool:

./KASY0 100 02010770 128 3 17 23
Seed was   1058580131
Pattern Generated: 1D torus +/- 1      offsets
Pattern Generated: 2D torus all of row and col ( 16,  8)
Pattern Generated: 3D torus all of row and col (  8,  4,  4)
Pattern Generated: bit-reversal
Pattern Generated: hypercube
PairCt =   1536, Min = 23, Avg =  24.00, Max = 25
 
Hit Control-D if there are no network configurations to load.
 
  1 network configuration(s) pre-loaded.  Generating 4095 more.
 
Starting full genetic search.
At gen start, net[bestpos].hist[0][0] = 4671
New best at generation         1:
  0:   0  15  16  31  32  47  48  63  64  79  80  95  96 111 112 120 121 122 123 124 125 126 127
  1:   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  24  40  56  72  88 104 120
  2:   7  23  39  55  71  87 103 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
  3:   2  10  18  26  33  34  35  37  42  43  44  45  46  50  58  66  74  82  90  98 106 114 122
  4:   5  13  21  29  37  45  53  61  69  77  81  83  84  85  86  89  92  93  94 101 109 117 125
  5:   1   9  17  25  33  41  49  57  64  65  67  68  70  73  75  76  78  81  89  97 105 113 121
  6:   4  12  16  18  19  20  22  23  27  28  30  36  44  52  60  68  76  84  92 100 108 116 124
  7:   3  11  19  27  35  43  51  59  67  75  83  91  96  98  99 102 103 104 107 109 110 115 123
  8:   6  14  22  30  38  46  48  50  53  54  55  56  59  60  62  70  78  86  94 102 110 118 126
  9:  32  36  38  39  40  41  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63
 10:   2  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  69  71  72  73  74  79
 11:  64  65  66  67  68  69  70  71  72  74  75  76  77  78  79  80  82  87  88  90  91  93  95
 12:   5  17  20  21  24  26  29  31  80  81  82  83  84  85  86  87  88  89  90  91  92  94  95
 13:   7   8  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 100 101 105 111
 14:  66  73  77  85  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 117 119
 15:  49  51  52  54  57  58  61  62  63  65  93  97  99 106 107 108 113 114 115 116 118 119 127
 16:   0   1   3   4   6   8   9  10  11  12  13  14  25  28  34  42
 17.
NICs/PC: 0[     0] 1[     0] 2[     0] 3[   128]
Dist[0]: 0[  4671] 1[  2769] 2[   665] 3[    23]
Dist[1]: 0[     0] 1[   957] 2[   556] 3[    23]
Above has Quality  99023211

The lines starting with a number followed by ":" specify the actual wiring pattern; the first number is the switch number, the remaining numbers on each line are the node numbers connected to that switch.

How does SFNN routing really work?

Our latest software, which can be viewed as a major improvement and generalization of "channel bonding," can be used for channel bonded networks, FNNs, and SFNNs. Basically, every NIC in every PC has a unique (private network) address which is set by our software to facilitate routing of messages within the cluster. We plan to release details in our research exhbit at IEEE/ACM SC2003 in November 2003.

What about your choice of NICs?

Previously, we deliberately used Realtek chipset NICs because, even though they were known for poor performance under Linux, they were obviously becoming the standard. For KASY0, we decided to use NICs known to give good performance under Linux: a "Tulip" chipset clone.

We do get slightly higher performance from the new NICs. However, about 1/3 of the new NICs drop a packet roughly every 10,000 packets. We don't yet know why. A few of the Realtek NICs had much worse problems and we simply discarded those NICs (we purchased quite a few spares), but we cannot discard 1/3 of the new NICs and the problem doesn't seem to be bad enough to call the offending cards "broken" and return them to the vendor.

Where are the keyboards, video cards, mice, etc.?

Elsewhere.

KASY0 is a dedicated computational cluster. Each node is not a complete PC/workstation. That said, it does take a little work to get your typical PC to boot without a keyboard or video card. Usually, all that is required is to change some settings in the BIOS to ignore the "keyboard errors", etc. These were configured when we did the initial power-on test for each node as they were assembled -- at which time each actually did have a video card and keyboard.

How do you power and cool KASY0?

With 210A of 120VAC and a 5-ton air conditioner. Actually, the operational power cost of KASY0 is quite significant in comparison to its build cost: it takes about 10% of the build cost to pay for one year of powering KASY0. In fact, one year of power is about the same cost as KASY0's network!

The physical layout (a degenerate spiral of racks that looks like a "squashed" circle) not only minimizes wire lengths but also simplifies cooling. Cold air comes above the racks, is pushed down through the center of the system by two box fans, and diffuses out forcing hot air away from the system. Unfortunately, the lab containing KASY0 has no windows, so the only place for heat to go is through the air conditioning heat exchanger.

Do I see two empty holes on the side of the cases where fans should be?

Sort-of. We removed the side-vent fans (most would be blocked by the cases next to them anyway). However, we then stacked them and put them in the back of the case they came from as a redundant rear exhaust. The racking of the cases is designed so that rear exhausting causes the right air circulation patterns. We're still pondering what to do with the 264 wire thingies that were used to keep folks from sticking their fingers in the fans when they were side-mounted. ;-)

Maybe UK can have students work for pizza, but how much would it cost if you paid people to assemble KASY0?

Although we had a lot of student help (over 50 people total), the point of that was getting lots of people involved in the technology, not really that we needed tons of cheap labor. After all, we are a University -- we're supposed to be inspiring and teaching students. :-) Aside from that, building your own systems implies higher shipping costs, and this cost alone pretty much cancels any cost savings from assembling your own.

That said, keep in mind that our group has built LOTS of clusters since February 1994. If you are building your first, it will take you longer. There are a lot of tricks we've learned. Even stupid things like assembling the shelving units can take hours if you get the wrong kind of sheving; having wheels on the shelves also saved us more than a few hours because things didn't have to be assembled in place.

What am I looking at in that photo of KASY0?

Which photo? How about this one:

Ok. What you see is 6 standard wire shelving units arranged in a degenerate-spiral/squashed circle with a couple of shelves in the center linking the shelving units to form a "wiring tray" and to structurally lock the opposing shelving units in place. Above the wiring tray are mounted a sign saying "KASY0" and two downward-facing box fans that slightly pressurize the core of the system with cold air blown along the ceiling by the room's air conditioner. The cold air thus flows down through the core and diffuses out.

The heavy cables coming from above 5 of the shelving units are the power drops. Yes, there are only 5, not 6: the 6th rack gets its power equally divided from the drops going to the 5 other racks. The racks are color-coded; the rack whose power comes from drops on other racks is colored white; you can clearly see the blue and yellow rack colors on the power patch-panels on the left. The other three racks are colored red, black, and green. Switches are also color-coded and 3 are placed on the inside edge of each rack such that the switch colors on each rack follow the rack color as a theme. For example, the blue rack is home to the blue, marked-blue, and purple switches -- and each cable is colored like the switch it is connected to. Thus, cables going to switches on other racks have colors that stand-out from those staying within that rack.

The orange cables coming off the front left are connections to a server and the outside world. We have already re-routed them so that they are less obtrusive... by placing the primary server on the outside bottom edge of the yellow rack (which is an empty slot in the above photo).

The only thing set in stone is our name.