Spring 2015 EE599-001 Cluster/Multi-Core Computing
Spring 2015 EE699-001 Cluster/Multi-Core Computing

Time & Place: TR 9:30-10:45 PM; meeting in 243 MMRB

Instructor: Professor Hank Dietz

Instructor URL: http://aggregate.org/hankd/

Course URL: http://aggregate.org/MIMDCOURSE/

ABET-style Syllabus: Combined Undergraduate (EE599) and Graduate (EE699)

Time & Place:	TR 9:30-10:45 PM; meeting in 243 MMRB
Instructor:	Professor Hank Dietz
Instructor URL:	`http://aggregate.org/hankd/`
Course URL:	`http://aggregate.org/MIMDCOURSE/`
ABET-style Syllabus:	Combined Undergraduate (EE599) and Graduate (EE699)

Cluster/Multi-Core Computing is about the many flavors of MIMD-parallel computing architectures that now dominate high-performance computing. It used to be that 4 processors were taken seriously as a parallel supercomputer. Now, there are cell phones with 4 cores. Sequoia, and the IBM BlueGene/Q system that was at the top of the Top500 supercomputers list in June 2012 was the first machine listed with over a million cores: it has 1,572,864 cores! Tianhe-2 (MilkyWay-2), the top machine as of November 2014, has 3,120,000 cores.

I'm going to run this course backwards from how I used to. Instead of starting with architecture, we're going to start with user-level programming and do the architecture stuff while you're working on projects. Why? Unlike GPUs and most other parallel computers, MIMDs sort-of just work and code is actually portable. What's more, that wasn't the case two decades ago. We'll still cover architecture issues and perfromance tuning, but towards the end of the course.

UK closed February 17 and 19, and March 5, due to weather. The classes missed those days and February 10 will be made-up later in the semester by extending the remaining Tuesday classes, starting on March 31, to begin at 8:15AM and meet in 108 Marksbury.

Course Materials

All course materials will be linked here. I said will be... as in more is being added.

Shared Nothing

Here is the introductory overview of MPI that I gave to the KAOS group last year (PDF).
Here is the UKAN reference card, the back of which is a nice one-page reference for the most common MPI usage (PDF)
One of the better outlines of MPI (listings with overviews of the arguments to each function) is this tutorial at Lawrence Livermore Labs
A nice overview of problems with MPI RMAs, which are inspiring the MPI-3 changes, is this document.
A really detailed set of slides on MPI IO is this PDF. A shorter overview is in this document.
Here (PS or PDF) is the paper I mentioned before about using barriers to maintain permutation communication structure on the CM5. (Note: the principle of using barriers to synchronize communications was used well before this paper, for example, in the PASM prototype as early as 1987, but the barrier and communication mechanisms were quite different from those in the CM5.)
Perhaps the most wacky stuff involves executing MIMD code on SIMD hardware. I've done a lot of this; see here and here.

Shared Memory

Here is a nice overview intro to OpenMP/OMP as slides (PDF). OMP pragmas are understood by recent GCC releases (GOMP is built-in), but must be enabled by giving -fopenmp on the gcc command line with no other special options; my Pi computation example for OMP is mppi.c. Normally, environment variables are used to control things like how many processes to make
POSIX Threads (pthreads) is now a standard library included in most C/C++ compilation environments, and linked as the -lpthread library under Linux GCC; my Pi computation example for pthreads is pthreadspi.c
My System V shared memory version of the Pi computation is shmpi.c -- note that this version uses raw assembly code to implement a lock, which has far less overhead than using the System V OS calls (unless you're counting on the OS to schedule based on who's waiting for what)
UPC (Unified Parallel C) is an extension of the C language, and hence requires a special compiler. There are several UPC compilers; the fork of GCC called GUPC must be installed as described at the project homepage (in my systems, it is installed at /usr/local/gupc/bin/gupc). My Pi computation example for UPC is upcpi.upc; compilation is straightforward, but the executable produced processes some command line arguments as UPC controls, for example, -n is used to specify the number of processes to create.
Here's "Shared Memory Consistency Models: A Tutorial" (PDF) -- Sarita Adve has done quite a few versions of this sort of description
AMD64 atomic instructions are listed here
Here's "An Overview of the NYU Ultracomputer Project (1986)" (PDF) which discusses how it implemented fetch-and-op
Many short, yet still confusing, descriptions of Futexes are available and here's probably the best early overview (PDF); the catch is that various Linux kernels have different futex() implementations with 4, 5, or 6 arguments
Here's "Treadmarks: Shared Memory Computing on Networks of Workstations," (PDF) -- a nice description of how they implement distributed shared memory
Transactional Memory has been a hot idea for quite a while; now, Intel's Haswell processors have a hardware implementation. It is described in chapter 8 of this PDF (locally, PDF); there is also a transactional extension proposal for C++ PDF.

Project Stuff

One of the nicest things about MIMDs is that they behave a lot like lots of processes running timeshared -- so you can develop and test just about everything in this course even on a single-core system. However, you'll be running code on at least one shared-memory multi-core system and you'll be running on at least one cluster. Those systems are housed in 108A Marksbury, but will be accessible via remote login (ssh). Those systems all run Linux, and everything submitted must work in that environment, although you can develop things under windows or OSX.

Thanks to Frank Roberts, a list of what hardware is available for students in this course to use is here.
Projects will be submitted via WWW forms and a CGI script. Register with the system here before you can submit any projects.
Assignment 0 was due February 5, 2015.
(the submission server died, so the deadline is pushed back slightly.)
Assignment 1 was due February 26, 2015. However, it will be accepted without penalty through March 5.
Assignment 2 is due April 9, 2015.
There is now a test-case generator program called linemaker.c. Compile it with cc linemaker.c -o linemaker and run it with two arguments: the seed and the number of lines to generate. A negative seed generates an optimal solution, while the positive seed generates a randomized line order.
You do not need to understand the details of GAs for this project, but here is DUMEC.tgz, the collection of sequential search codes I used to introduce GAs in class.
Assignment 3 is due April 23, 2015.
This is a really simple assignment using OpenMP. Don't try to make it harder than it is.
Assignment 4 is due May 7, 2015.
This is a really simple assignment using both MPI and OpenMP.
(The full assignment isn't quite posted yet.)

Course Staff

Professor Hank Dietz is usually in his office, 203 Davis Marksbury Building, and has an "open-door" policy that whenever his door is open and he's not busy with someone else, he's available. However, there are quite a few other places he might be around campus. His schedule is at http://aggregate.org/hankd/cal.html, which also shows live sensor info about his availability. Alternatively, you also can email hankd@engr.uky.edu to make an appointment.

About The Graphic

About the graphic: It's NAK, a 64-node cluster with a GPU in every node, but dressed-up as the machine that runs everything in the movie Metropolis. There's a lot to program in there, and the huge set of independent workers is a pretty good model for how that's done. Fortunately, the machine here also carries a useful message written in friendly big red letters.

Cluster/Multi-Core Computing

Spring 2015 EE599-001 Cluster/Multi-Core Computing Spring 2015 EE699-001 Cluster/Multi-Core Computing