The PAPERS Museum

This hypertext document presents a brief illustrated history of the development of PAPERS, Purdue's Adapter for Parallel Execution and Rapid Synchronization.

PAPERS0 (February 1994)

The first PAPERS prototype, this box was designed and constructed in less than two weeks... although we subsequently spent several months debugging and testing it. The unit connects to four PCs using standard printer cables to go from each PC to the corresponding Centronics connector mounted on the back of PAPERS0. Within the Oak box (with the top attached by velcro) are a power supply, rear-mounted Centronics parallel port connectors, wire-wrapped main circuit board, and wire-wrapped 40-LED display board. The LED display was a tad excessive, and was the vast majority of the power draw, but it looked cute and was helpful in debugging.

The full dynamic barrier mechanism is implemented using just one AMD 22V10 PAL per processor, with a group of TTL drivers used to ensure proper interface levels for the parallel printer ports of four PCs. Each logical barrier synchronization requires 4 port operations (cycles), organized as a barrier followed by an "anti-barrier": the barrier to synchronize, the anti-barrier to ensure that all participating processors have detected that synchronization was achieved.

The only communication supported by the hardware was a 1-bit multibroadcast intended to provide a means for voting on membership in a new barrier mask. However, we observed that arbitrary communication patterns and functions of the aggregate state also could be implemented using this hardware, and thus was born the concept of associating aggregate communications with barriers. If the data bit sent by a PE was the same as the last value sent, the transmission required 4 cycles; otherwise, a fifth cycle was needed to give the changed data bit time to settle (in essence, avoiding a race with the barrier GO signal). Unfortunately, the electrical characteristics of the parallel printer port turned out to be a lot more "interesting" than we had expected. We spent a lot of time in the MSEE 190 undergraduate laboratory debugging noise problems, which isn't too big a surprize when you combine the port characteristics with the wire-wrap jungle inside the box....

Because the parallel printer port supports generation of interrupts, we were very careful to make PAPERS0 able to generate interrupts either when a barrier synchronization was achieved or when a processor sent a "parallel interrupt request." How the interrupt would be triggered was software selectable. However, we quickly realized that latency and OS problems made it unwise for a parallel interrupt to generate a "real" interrupt on the PC, so generation of real PC interrupts was always disabled in our library.

TTL PAPERS (June 1, 1994)

Given how well PAPERS0 worked, and that we had a pretty good idea of what kind of electrical surprises to expect from the printer ports, we decided to design and build a simplified version that would use only TTL parts -- no PALs. This unit used just 8 standard TTL chips to implement a non-partitionable static barrier version of PAPERS0.

Aside from the logic simplification, the TTL PAPERS unit incorporated a few improvements. An obvious change is that the front panel has just one LED for each PE and one LED that acts as a power indicator. This brought the power consumption down to a level that allowed us to use a cheap AC wall adapter unit, a few capacitors, and a 7805 voltage regulator as the power supply. We also simplified construction, reduced signal noise, and cut cost by directly connecting cables to the circuit board rather than connecting them to a rear-mounted connector that is in turn connected to the circuit board. Because the cable mounts and power supply no longer mandated a large box, we were able to make the box much smaller, and changed to a design that used a pivoting "hood" for the cover. Another change was that all connections were soldered rather than wire-wrapped, greatly improving reliability.

For all you woodworkers out there, the TTL PAPERS box is Oak with a cover made out of Poplar.

PAPERS1 (June-August 1994)

Just as TTL PAPERS was simplifying the PAPERS0 design, PAPERS1 was attempting to create a higher-performance enhanced version. Because it was to be the high-performance version, the PAPERS1 hardware has the dubious honor of having undergone enough revisions so that we lost count long ago. Phrases like "no, that PAL design is ancient... it's from last week" come to mind....

In any case, and, incidentally, the case of PAPERS1 is made of Pine, PAPERS1 does yield optimum performance. It is a full dynamic barrier mechanism with a variety of enhanced data communication operations, and all PAPERS1 operations require just 2 cycles. Inside the case....

Each PE corresponds to some TTL drivers and two AMD 22V10 PALS: one "barrier PAL" and one "communication PAL." This separation allowed PAPERS1 to perform both 1-bit multibroadcast (like PAPERS0) and a 4-bit multibroadcast that we now call "putget." (Currently, PAPERS1 has been upgraded by replacing 1-bit multibroadcast with NANDing as described below.) Making PAPERS1 capable of 2 cycle data transmission required a bit of cleverness implemented by a carefully crafted state machine. In essence, PAPERS1 internally simulates a multi-cycle data transmission using its own clock; this can yield near peak performance, but also makes it necessary to tune the internal clocking to the PC port and cable characteristics.

The board is wire-wrapped, but, unlike PAPERS0, wires were routed very carefully to minimize interference. Having 8 PALs meant a 300ma AC adapter wouldn't suffice, so there is a small switching power supply inside and an on/off switch on the back. The method for connecting cables and the LED display resemble those of the first TTL PAPERS.

It is also worthwhile to note that PAPERS1 upgraded the PAPERS0 concept of a parallel interrupt to include a special "interrupt acknowledge" barrier. Although the library still doesn't generally use this mechanism, it is used to achieve the initial synchronization when a parallel code begins execution. There is also a hardware ID feature that allows a PE to determine which PAPERS PE it is connected as -- earlier versions of PAPERS had to be explicitly given their PE number.

TTL PAPERS (July 1, 1994)

Just when we thought that things couldn't be done any simpler than the first TTL PAPERS, we realized a few things:

A simple 4-bit NAND across the processors can perform 1-bit multibroadcast, and it also can broadcast, global OR, and global AND. This also leads to a hardware design in which all PEs have identical connections, so that PE numbering is simply a matter of convention.
By using two strobe signals to request a barrier synchronization, static partitioning can be supported with minimal hardware.
Why have interrupt logic if you don't use it?
Although the simpler designs could not duplicate the 2 cycle data communication performance of PAPERS1, no extra hardware is needed to use a toggling ready signal to achieve 2 cycle barrier synchronization.
The single LED for each PE was not quite enough status information, but a bi-color LED for each PE would add very little complexity. We also made the power LED blue to avoid confusion with the PE status LEDs. (Ok, we really did it because blue LEDs are cool, neato, spiffy, keen... but don't tell anybody. ;-)

All of this led to a second TTL PAPERS that was built using only 5 standard TTL chips. Well, it was really 6 chips if you count the additional TTL driver chip we added later to brighten the LEDs. By the way, the case is a slightly rounded version of the earlier TTL PAPERS case, but made with Aspen instead of Oak.

Something else wonderful happened with this prototype: we finally got a place where we could keep a PAPERS unit connected long term. Up to this time, we had been borrowing a few of the 486DX2/66 machines in the MSEE 190 undergraduate laboratory, but we could only use those machines when classes didn't need them. The cluster of 486DX33 machines shown with the second TTL PAPERS box finally gave us a place to experiment without having to compete with undergraduate students for access to the machines.

8 PE PAPERS (August 1994)

All the above is good stuff, but does it scale? Although this Oak and Aspen box is physically even smaller than the first two 4 PE TTL PAPERS boxes, it was put together for the sole purpose of demonstrating that the design scales at least to an 8 PE cluster. We were not very particular about how it would prove scalability, so the functionality is essentially a supercharged version of what TTL PAPERS supports, with the state machine timing properties of PAPERS1. Thus, it provides static barrier synchronization with 4-bit NANDing, all done with 2 cycle speed. Inside the box....

The single wire-wrapped circuit board is very densely packed with a variety of TTL and AMD PAL parts implementing a non-partitionable static barrier mechanism. Unfortunately, the wire-wrap was apparently a bit too dense, because we measured something on the order of a 2 volt spike on one wire that wasn't supposed to be doing anything at the time... a few capacitors cleared-up these little problems, but left us cursing wire-wrap. We also used a different method to connect the cables to the board: we made a little DIP header for each group of similar signals from the PCs. We will not do that again either. Oh yeah. It also was the first PAPERS unit to need a heat sink on its 7805 and a fan.

Ok, so this one was an evolutionary dead end. Anyway, it works and it is fun to watch the 8 bi-color LEDs as it plays one of our MIMD multi-voice music demos.

16 PE PAPERS (August-November 1994)

Don't ask. We tried to quickly build a 16 PE static barrier 2 cycle unit using a TI FPGA and some simple 4 PE signal-conditioning boards (shown in the photo above). Results: (1) we proved that many Purdue EE students don't know how to solder and (2) doing something like this in a rush virtually guarantees failure. It was a good experience in learning how to use Mentor Graphics and the FPGAs, making our own printed circuit boards, etc. Although we completed several of the 4 PE signal conditioning boards, we'll probably never bother finishing this prototype -- the design is now obsolete.

TTL_PAPERS (November 1994)

Ok, you can ask about this one. Actually, that's why we built it (and because the 16 PE unit wasn't going to be ready in time for our booth at Supercomputing ;-). This is the unit we've been building in significant quantity and supporting as a full public domain hardware design and support software release.

Basically, this version of PAPERS is just like the second TTL PAPERS unit, except:

The box is smaller and the cover is simply a tight fit -- there isn't any pivoting hood nor even any velcro.
The circuit board is a single-sided custom board rather than a prototyping board. Further, the circuit board allows the LEDs to be directly mounted on the "wrong side" so that they poke through holes in the front panel; earlier PAPERS units required careful hand-fitting of the LEDs into their mounting holes.
TTL_PAPERS implements a special interrupt barrier mechanism similar to that of PAPERS1. This adds 2 TTL chips for selecting between the normal and interrupt ready signals, bringing the full count to 8 TTL chips.

Although the particular box pictured above is made out of Aspen, other copies have front and rear panels made out of Oak, Cherry, Walnut, Poplar, Pine, and Mahogany. Incidentally, although the PE numbering is arbitrary, the unit shown in the above photo has the display numbered with higher PE numbers corresponding to lower positions on the panel, which is the reverse of our "standard" numbering.

We are using this version of TTL_PAPERS for our permanent clusters. For example, the following photo shows our first Pentium cluster. These machines were donated in Summer 1995 by Intel specifically for the PAPERS project. Each holds a Pentium 90, 32M RAM, and 700M disk.

Since other people at other places also have been building this type of TTL_PAPERS unit, (e.g., Prof. Will Cohen has built one at the University of Alabama at Huntsville), it is useful to take a closer look at some of the construction details for TTL_PAPERS. The back of the box is....

From this view, you can see the power and cable connections on the rear panel. Opening the box....

This photograph reveals the construction of the box itself as well as the installation of the circuit board and cable connections. Notice that all the signal ground connections are made by mechanically connecting and soldering directly to a common ground post in the base of the unit -- this both provides a better electrical ground and a solid physical connection to help ensure that the cables will not pull out (this type of ground connection was used on all but PAPERS0 and PAPERS1). Partially removing the board....

If you back the board away from the front panel, you can see how the LEDs are mounted on the "wrong side" of the board and how they fit into the front panel mounting holes. In fact, this fit is generally tight enough that no separate mounting hardware is needed to attach the board to the box.

TTL_PAPERS 950801 (August 1995)

Although the TTL_PAPERS design of November 1994 was widely accepted, and a few other universities have built clusters using that design, we still have trouble getting some people to take it seriously because it only connects 4 machines. True, we did detail how to scale to larger systems, but that left a lot of people unconvinced. There was also the problem that scaling to a larger cluster meant building a whole new PAPERS unit... you can't incrementally expand the unit. In contrast, TTL_PAPERS 950801 is an 8-processor unit that modularly scales to thousands of processors....

The practical maximum number of machines that can be placed in a single rack is 8; thus, larger systems would most naturally be composed of multiple 8-machine racks. Ideally, the cluster should be able to be constructed by simply connecting the PAPERS modules housed within each rack of 8 machines. Further, to minimize wiring distances within each rack, the PAPERS module should really be placed in the middle of the rack (rather than being a stand-alone box). TTL_PAPERS 950801, which is designed to be built-into a slide-out drawer within a wooden 8-machine rack, meets these goals by implementing a modular version of the TTL_PAPERS design of November 1994.

As our first attempt at a modular design, there have been quite a few new problems to be solved. Perhaps the most difficult question is: what interconnect pattern should be used to link the units of multiple 8-machine racks? The answer we have implemented is that TTL_PAPERS 950801 units can be linked in a tree structure with a fan-out of five and an increase in operation time of <200ns for each level in the tree. Thus, a two-level tree allows up to 8 + 5*8, or 48, machines; a three-level tree allows up to 8 + 5*8 + 5*5*8, or 248, machines. A four-level cluster could use as many as 8 + 5*8 + 5*5*8 + 5*5*5*8, or 1248, machines while adding only about 4 * 200ns, or 0.8 microseconds, to the time for each basic operation.

This tree-structured expandability unfortunately implies that there are actually four distinct configurations of TTL_PAPERS 950801 boards: stand alone, root node, internal node, and leaf node. Although the same board layout works for all, there are significant population and wiring differences between these configurations. Further, we did not have space for enough drivers for the internal node configuration, so separate driver boards are needed for very large clusters.

TTL_PAPERS 951201 (November 1995)

The TTL_PAPERS unit we released at the IEEE/ACM Supercomputing conference in November 1994 was a sanitized and improved version of the earlier 4-processor TTL_PAPERS units. Likewise, the TTL_PAPERS 951201 which we will be releasing at Supercomputing 1995 is an improved version of the modularly scalable 8-processor TTL_PAPERS 950801.

None of the changes from TTL_PAPERS 950801 is particularly dramatic, but there are dozens of incremental improvements. Many of these improvements relate to the electrical and mechanical properties of the board, but a few extensions have been made to the functionality. Although TTL_PAPERS 951201 only supports a tree fan-out of four at each level (versus five for the 950801 design), there is no need for additional drivers and all four board configurations are supported without major wiring changes. There is also a new interface on the board that facilitates connection of external logic to support real-time control applications (e.g., an external timer or sensor can trigger a barrier).

When will more details on the TTL_PAPERS 951201 design be publically available? At and after our Supercomputing 1995 research exhibit, December 5-7, 1995.... Until then, here's a peek inside:

CAPERS (March 1996)

Hey... isn't that just a photo of a cable? Yup. To be precise, it is what is sometimes called a LapLink cable, but we call it CAPERS: the Cable Adapter for Parallel Execution and Rapid Synchronization. While we were busy improving the TTL_PAPERS library, we realized that it would be possible to implement a version for two processors using only a passive parallel cable connection between the machines. This doesn't scale, but it sure makes it easy and cheap to try out the library. Aside from the inability to scale, performance is slightly worse than using the TTL_PAPERS hardware.

PAPERS_JR 960801 (August 1996)

After we had completed the TTL_PAPERS 960801 board layout, we found that we had a little corner of the layout rectangle unused (because the 960801 layout actually has two boards: a main board and a boardlet used to support scaling). After getting price quotes on the two-layer plated-through board, we discovered that the price would be unaffected by whatever we decided to put in that corner. So we couldn't leave it blank. ;-)

One idea was to make a PAPERS keychain. Another passing whim was to make it an image of the group's business card. However, we finally figured-out something useful to do with this space: PAPERS_JR.

PAPERS_JR is essentially a CAPERS unit, but it adds two features:

It has the now-standard-for-us LED display. Each processor controls a bi-color LED and there is a blue power LED.
While CAPERS is not capable of supporting OS-level parallel interrupts, PAPERS_JR can work with a gang-scheduling meta-OS.

Neither of the above TTL_PAPERS-like features are implemented in the same way that they are implemented in TTL_PAPERS, rather, they are implemented in a way that allows the PAPERS_JR library to work with CAPERS hardware for all that CAPERS supports.

Of course, PAPERS_JR is both very small and very simple: just two TTL parts. The wooden case for PAPERS_JR (mahogany, in the photo ;-) is exceptionally simple to make, and looks a lot like a solid block of wood with a hole in it for cooling the 7805 voltage regulator. It makes a great stocking-stuffer. ;-)

TTL_PAPERS 960801 (August 1996)

Although TTL_PAPERS 951201 makes a nice, modularly scalable, eight-processor TTL_PAPERS, there are a few annoying things about that design which we have fixed in the TTL_PAPERS 960801 design. Of course, every thing one fixes breaks something else.... So, here's what's new:

Eight machines typically fit in a rack, but a four-machine module size seems much more convenient for handling and is less susceptible to catastrophic problems from single-machine failures. So, 960801 is a four-processor module. The bad news is that to connect eight machines takes three units (each module has only one scaling connection), but connecting 16 machines takes only five, etc.
Unlike the previous modular scalable designs, TTL_PAPERS 960801 is easily scalable in the field. Basically, the top node in a tree, or a stand-alone four-processor unit, takes a boardlet that plugs into the DB25 scaling connector on the module board.
There is a lot to be said in favor of single-sided circuit boards, but soldering the wires of each cable individually gets old pretty fast. Further, as a practical matter, a large-scale system is much more likely to encounter cable troubles that would require one or more cables to be replaced... an insurmountable task with the 951201 design. Thus, TTL_PAPERS 960801 uses DB25 connectors for all cable connections. The primary disadvantages are that the resulting board is two-sided with hundreds of plated-through holes (kids: don't try to make this at home), it is also several times larger than the other four-processor TTL_PAPERS units, and cable and connector expenses are non-trivial.
Although a topic of heated debate within our group, additional circuitry (mostly pull-up resistors) were added to allow use of HCT CMOS parts instead of LS TTL. The CMOS parts offer sharper, cleaner, logic levels... but that also implies that they cause more switching noise. In any case, you can use either family (or even mix them on a single board). These pull-ups also mean that, unlike all other PAPERS units, you'll see LEDs light when the PAPERS unit is connected but not powered. Largely because of number of driver chips, this board draws more power (supply should provide around 500 mA at 9 V). That is enough power to make it a little hard to find an inexpensive wall-mounted AC-to-DC adapter. Powering multiple boards from a single OEM-style power supply is thus more appropriate.
The board is designed to slide into the case (which in the photo has a walnut front panel ;-). This avoids fancy mounting hardware and makes it very easy to change cable connections.

In summary, this four-processor design is not really worthwhile if all you want to connect is four or eight machines. However, if you expect to have a lot more than four, and/or to incrementally add more machines, this is the design you want.

Thank you for visiting the PAPERS museum. As further developments are made (or whenever we get around to it) new exhibits will be added. If you have any suggestions or comments, please send them to hankd@engr.uky.edu.

The only thing set in stone is our name.