As if you didn't know, this document is always under construction....
Don't see your question above? You can send email to
Maybe we'll even answer. ;-) (remove the NOSPAM for a valid e-mail address)
KLAT2 stands for Kentucky Linux Athlon Testbed 2, rather obviously the second Linux cluster we built at the University of Kentucky, Lexington, KY using AMD Athlon processors.
Gort and Klaatu (KLAT2 not pictured)
Yes, KLAT2 also is an obscure reference to Klaatu, the fellow from outer space in the classic 1951 science fiction movie The Day The Earth Stood Still. In the film, Klaatu comes, with the robot Gort, to explain to all the people of the earth that if humans cannot work together in peace, the earth will be destroyed for the good of all planets. We like the analogy.
Yes. Every hardware component is an unmodified commodity subsystem, available from multiple vendors. Although KLAT2 runs MPI applications unmodified, some of our system software is odd (but also free). For information on other "Beowulfs" see http://www.beowulf.org/, http://www.beowulf-underground.org/, and the Parallel Processing HOWTO. For an answer to "What's a Beowulf?" see the first entry in the Beowulf FAQ.
Later upgrades to KLAT2 will add a few "custom" components, but we didn't need any of them to get the performance on the applications that we've used thus far. Our custom hardware primarily expands the range of applications that can get good performance, KLAT2 works very well without the custom stuff for things that already worked on other clusters.
It really gets over 64 GFLOPS on 32-bit ScaLAPACK. Using an "untuned" 80/64-bit version, KLAT2 gets a very respectable 22.8 GFLOPS. These aren't theoretical numbers, they are the real thing. The theoretical we-will-never-see-that numbers are 179 and 89 GFLOPS, respectively, for 32-bit and 80/64-bit floating point.
Yes, we know ScaLAPACK is only one application and not a very general one at that. We have other stuff running as well. In fact, we submitted an entry for a Gordon Bell price/performance prize based on running a complete CFD package on KLAT2. The only code in common between ScaLAPACK and the CFD package is the LAM MPI library that we modified to understand KLAT2's FNN.
KLAT2 has a bunch of fast processors that are connected to the Internet (by a lousy 10Mb/s connection through a firewall), so we can indeed run all those wonderful "useful screensavers that don't need a screen." In case you had not guessed, we don't consider those to be very good tests of a supercomputer's abilities. Being fast at "parallel" programs that virtually never communicate between nodes always has been easy.
Although we in no way wish to suggest that it is a good use of a supercomputer's time, yes, we have let KLAT2 work on www.distributed.net between doing "serious" work. KLAT2 handles over 150M RC5 keys/s. Our Athlons, KLAT2's 66 700MHz + Odie's 5 600MHz, rank something like number 26-37 in the daily list (when we let them run): see the entry for email@example.com.
We haven't run SETI@home, etc. The only reason we ran the RC5 stuff is that we had played with that before, so it was a no-brainer to let it run. None of this stuff is what we built KLAT2 to do.
We're looking at some of this stuff... using KLAT2 as a render engine for the video wall driven by Opus would be really cool.... Thus far, all we have is the single-processor POVBench number of 31 seconds for one of our 700MHz Athlons. We anticipate at least trying MPI-Povray (or another version) on KLAT2 in the near future.
Yes... with some disclaimers. There have been computers under $1K/GFLOPS, but they were either not scalable or achieved their performance only on a very specific customized code. At the time that we made KLAT2's press release, we were pretty certain that KLAT2 was the first scalable, general-purpose, supercomputer to pass the $1K/GFLOPS mark.
In a later WWW search, we stumbled upon another machine that seemed in many ways similar to KLAT2, so we contacted its builders by email to share insights and discuss potential future collaboration. That's when they asserted their claim. Apparently, KLAT2 and Bunyip (from the Australian National University, Canberra) passed the $1K/GFLOPS mark at virtually the same time. After compensating for the fact that Australia is in an earlier time zone than the USA, Bunyip seems to be documented as crossing the line hours before KLAT2. However, we didn't document KLAT2's performance until after we hit a performance plateau significantly past the $1K/GFLOPS mark, so we don't honestly know precisely when we passed the mark. The waters are further muddied by the fact that while KLAT2 passed the mark by a large margin, Bunyip passed it by about 2% (depending on the precise currency exchange rate). Also, judging by the ratio to uniprocessor peak, Bunyip probably wouldn't be under $1K/GFLOPS solving the standard benchmark problem that we used on KLAT2 -- Bunyip's performance is quoted running a customized application. There is even the issue that, although Bunyip is larger than KLAT2, KLAT2's design is scalable to much larger systems than Bunyip's without going above $1K/GFLOPS.
Given all the above, I think it is most fair for KLAT2 and Bunyip to share the official credit for breaking the $1K/GFLOPS barrier.
(In case you were wondering, because I was, Bunyip takes its name from monsters that Aboriginal legends say live around water, make loud noises, and eat people. IMHO, not a real user-friendly name for a cluster. ;-)
Yup. Although KLAT2 can run many codes well, we plan to create a series of customized cluster designs and software for a variety of specific applications in science and engineering -- PeTS (Personalized Turnkey Superclusters). For various reasons, CFD (Computational Fluid Dynamics) will be the first PeTS target, and we have used KLAT2 for prototyping that system. Although CFD is not an easy application for a cluster, KLAT2's performance is good enough that it is currently a finalist for a Gordon Bell Price/Performance award.
The paper describing what we did is: Thomas Hauser, Timothy I. Mattox, Raymond P. LeBeau, Henry G. Dietz and P. George Huang, "High-Cost CFD on a Low-Cost Cluster," Gordon Bell Price/Performance Finalist and regular paper in SC2000, Dallas, Texas, USA, November 4-10, 2000. Preprints are available as 13MB PS and 31MB PDF versions for personal use only. (There also is a 4MB PDF version that some PDF viewers don't like.)
Although several upgrades are planned, KLAT2's initial configuration (the one that it used for all the benchmarks quoted above) is:
For details see our cost table.
Several aspects of KLAT2 are new:
Several planned upgrades will soon add to that list....
To put it simply, we use Athlons because they give us the best price/performance. For our software technology, AMD took the lead over Intel as soon as AMD came out with the K6-2 300MHz -- the 3DNow! extensions had no competition from Intel or anybody else. The Athlons are not only faster than the K6-2 and K6-3, they're also much more tolerant of less-than-ideal code sequences.
Although carefully-coded Pentium III SSE should be competitive with 3DNow!, it is a more difficult target for our compiler technology, and Intel has not been as supportive of our compiler development work as AMD has been. The key benefit that Intel offers is the ability to have more than one processor in a shared-memory PC, which wasn't enough of a benefit to surpass the better support we have for 3DNow! Actually, the only other system we know of that comes close to KLAT2's price/performance uses dual Pentium III nodes: Bunyip, from the Australian National University, Canberra. They spent months hand-tweaking SSE code for the SGEMM routine; it took us a few days with 3DNow!
Despite the wonderful ads with the tanks, the PowerPC AltiVec (aka Mac G4) was not the first PC to reach 1GFLOPS; the K6-2 300MHz was. Similarly, although the G4 in theory can do much better than the Athlon, the system prices are way too high to achieve comparable price/performance. AltiVec is very nice and the G4 also uses general registers for conventional floating point, which gives it a big edge over the IA32 stack model in terms of ILP (Instruction-Level Parallelism within the CPU).
What about the DEC, er, ah, Compaq Apha? Too pricey for too little extra performance. In fact, on ScaLAPACK, LANL's Avalon cluster with 68 533MHz Alphas gets 19.33 GFLOPS using 64-bit doubles; KLAT2's 64 700MHz Athlons easily get 22.89 using 80/64-bit doubles. Also, the Alphas don't get much faster using 32-bit singles, whereas the Athlons zip to over 64 GFLOPS. Other "workstation" processors don't even come that close... but there's always next year. :-)
BTW, yes, AMD did donate the processors to us. However, that has nothing to do with the above -- I probably could have gotten Intel to donate processors too. In fact, Intel has a history of being more generous in supporting my group's cluster work than AMD has been. I hope to continue working with everybody; we're just trying to develop and freely diseminate new technology.
Of course they should. ;-) The problem is that fast SDRAM isn't cheap enough to add more without good reason, and our initial applications did not need more than 128MB per processor. Actually, we'd probably want 256MB per processor if we were tuning for 80/64-bit performance rather than 32-bit 3DNow! performance. In any case, we are very likely to add more memory as funding permits; each PC can hold 3/4GB.
Well, each PC was originally going to have a 20GB disk, but we didn't need it for our initial applications, so.... Also, it really pays to wait for purchasing disk drives. Again, budget permitting, we plan to add at least a TB of disk.
It also is worthwhile noting that KLAT2 is sitting in a lab with other clusters that have disks on each node -- it is pretty easy for KLAT2 to use another cluster as an "I/O partition." We do the same thing for visual output, using Opus to drive a 6,400x4,800 pixel video wall. The other clusters are connected through the many spare ports of KLAT2's uplink switch, so network performance between clusters can be quite good. There is at least 1.8Gb/s of total bandwidth to the cluster nodes available through the uplink switch, and the full 800Mb/s provided by 4 NICs is available for I/O with any individual node.
Ordinary terminal access, etc., also is done from other machines via the uplink switch. Although the two "hot spares" in KLAT2 can be viewed as front ends for the cluster, we really do not use them that way. The cluster nodes are treated completely symmetrically, and parallel jobs can be initiated via remote access to any node. That said, all our clusters are behind a dedicated firewall that isolates the KAOS lab from all the "baddies" out there on the internet.
Simply put, we originally intended to use Gb/s NICs and switches, but they are not yet cost effective in comparison to our Flat Neighborhood Network design implemented using 100Mb/s Ethernets. You could argue that the lower latency of things like Myrinet is worth it, but KLAT2 will soon have a secondary network based on the PAPERS hardware we've been using since February 1994; that hardware has far lower latency than Myrinet and is dirt cheap (because we build it ourselves).
Keep in mind, we don't mind rolling-up our sleves and getting into some "bleeding edge" technology. If you don't want to do that, and you're not "budgetarily challenged," I actually do recommend Gb/s hardware over our 100Mb/s FNNs.
Take a look at our page on FNNs. Basically, you can do all sorts of topologies ranging from switchless node interconnect patterns to massive switch heirarchies, but each topology has both advantages and disadvantages. Switchless networks (e.g., subclasses of torroidal hyper-meshes) have good bisection bandwidth (limited by sharing of routes through PC nodes) and minimal cost, but terrible latency for any path through one or more PCs. Other switch topologies end up being fabrics that, while providing lower latency than hyper-meshes, still have several switch latency on typical communications and also "waste" a lot of switch bandwidth within the fabric. FNNs minimize latency and maximize the contribution of each switch to bisection bandwidth. The primary disadvantages of FNNs involve their design complexity... which we've solved by automating the process.
Every NIC in every PC has a unique MAC address (and potentially unique IP address) -- i.e., this is not channel bonding. Given this arrangement, there are two "levels" of intelligence with which FNN message routing can be accomplished.
Basic Routing is trivially implemented by forcing each machine to have its own (possibly unique) routing table. Thus, when machine "A" looks-up machine "B", it may find that "B" has a different MAC address than what machine "C" thinks is "B". This same principle works for IP addresses, MPI nodes, etc. For example, IP works by using the host file (/etc/hosts) and a pre-loaded ARP cache (some unix-version-dependent code here) to implement each machine's routing table. MPI works because it is built on top of the IP machanism. All KLAT2's initial performance numbers have been achieved using only basic routing at the IP layer or above.
Advanced Routing is not trivial. The difference is simply that basic routing always uses one path between a particular pair of PCs, whereas advanced routing may use several paths and/or dynamically change which paths are used. Using multiple paths is easy -- but is not yet built-into MPI or IP routing. Currently, that would be done at the user application level, but we intend to migrate it lower in the near future. Dynamically changing which paths are used is significantly more difficult at the user level, but we are working on "communication compiler" technology that would do sequences of parallel communications optimally using both multiple paths and path changes. It is unlikely that we will ever move dynamic path selection into a lower level, although we might use a very simple version of it for some degree of fault tolerance.
Well, sort-of. We do have some second thoughts.
A lot of people have pointed-out that the NICs we use have a "notoriously slow" chipset, but that's not quite true. The RealTek chipset in our NICs is definitely one of the cheapest and most commonly available; it also is true that the Linux IP driver for it requires an extra buffer copy. However, it isn't really the chipset's problem -- it is caused by an alignment mismatch between Linux buffer allocation and what the chipset expects. Since we knew that we would be devloping our own low-level drivers (much like Gamma), and the famous Tulip chipset has become rarer/pricey, we figured the RealTek chipset was a better target for new development. I'm sure our IP-based communication is getting less performance than it might with a Tulip-based NIC, but it wasn't enough difference to show-up in any of our tests thus far (probably because of switch pipelining, see below).
The switches are another matter. Most people seem to think that a cheap switch must have lower bandwidth; this isn't always the case. Actually, the architecture of modern switches is not as powerful as one might expect; most switches use switch modules connected by a ring or ring-of-rings, with true wire-speed only within a module or across few enough modules so that the ring bandwidth suffices. People also complain about the fact that our switches use store-and-forward rather than cut-through routing, so latency is higher for large packets. However, latency on small Ethernet packets is not very different for store-and-forward and cut-through because both still have to buffer the header; large messages (e.g., typical PVM or MPI messages) get broken into multiple 1536-byte Ethernet packets and the cheap switches generally have enough buffer memory to allow quite a bit of pipelined overlap with the NIC, so large messages also end up with about the same latency you'd see with cut-through switches. Actually, we care very deeply about latency and have been making a lot of noise about that since 1994, but Ethernet (and even Myrinet, etc.) are inherently slow by our standards -- soon, KLAT2 will have a secondary network that has about 3us latency (a new version of our PAPERS hardware).
Our real problem with cheap switches isn't performance, but reliability. The particular Smartlink 31+uplink switches that we bought have a design flaw that causes a stunningly terrible 60% failure rate within a month or so. Smartlink is replacing the switches with a new design, but in retrospect we would have been better off going with the much more common 23+uplink switches. In fact, that was our original intent and it is even slightly cheaper. Oh well.
KLAT2 is a dedicated computational cluster. Each node is not a complete PC/workstation. That said, it does take a little work to get your typical PC to boot without a keyboard or video card. Usually, all that is required is to change some settings in the BIOS to ignore the "keyboard errors", etc. On the FIC SD11 motherboards in KLAT2, we had to enable "fast boot" for it to ignore the missing keyboard, and had to set the "Primary Display" to "absent" for it to boot without a video card. These were configured when we did the initial power-on test for each node as they were assembled -- at which time each actually did have a video card and keyboard. Hint: do one node all the way first, so you don't have to go back and adjust some BIOS setting on each machine a second/third time!
We've had help from many companies. In general, major support has come to us only through long-term relationships based on development of new technologies. AMD has a long history with us rooted in the fact that we do systems work that can be of direct value to them. Unless that's true of your work also, don't expect freebies.
If you are looking for a cheap source of PC hardware, there are a number of easier paths:
Obviously, writing proposals is the best solution for us academics in general. Helps with tenure, etc. ;-)
Well, if there wasn't air conditioning, the room would definitely get warm. We haven't yet measured the precise heat or power loads (KLAT2's nodes will be getting significant upgrades soon, so we've allowed for them). However, take a close look at the photo of KLAT2. Notice that big metal duct in the upper left?
Well, our lab's upgrade gave us enough power and air conditioning for a fully-loaded 512 PC cluster, but everything in the lab is less than 1/4 that load. The ductwork is designed to shoot cold air across the ceiling of the entire room. The result is that the entire lab gets rather chilly... the cardboard "deflectors" on the ductwork, and the two fans on the side, are used to redirect the airflow so that the rest of the lab stays at a more comfortable temperature. Thus, the true purpose of the fans is to keep cold air from reaching Tim Mattox's desk on the other side of the room.
Although we had a lot of student help (30-40 people total), the point of that was getting lots of people involved in the technology, not really that we needed tons of cheap labor. After all, we are a University -- we're supposed to be inspiring and teaching students. :-)
Even so, we estimate that it took less than 60 person-hours to put everything together. About 50 of those 60 hours were spent on building the PCs themselves. We could have bought the PCs pre-built at little or no extra cost, but building them ourself allowed us to get precisely what we wanted. For example, there was only one vendor for the cases (which we love!) that had the 300W power supply, but that vendor didn't have everything else we wanted. Yeah, a 250W power supply would have worked too, and it would even have been a bit cheaper, but we got better stuff and the students got the experience of seeing firsthand how these things get put together.
That said, keep in mind that our group has built LOTS of clusters since 1994. If you are building your first, it will take you longer. There are a lot of tricks we've learned. Even stupid things like assembling the shelving units can take hours if you get the wrong kind of sheving; having wheels on the shelves also saved us more than a few hours because things didn't have to be assembled in place.
The only thing set in stone is our name.