Spring 2015 EE599-001 Cluster/Multi-Core Computing
Spring 2015 EE699-001 Cluster/Multi-Core Computing
Cluster/Multi-Core Computing is about the many
flavors of MIMD-parallel computing architectures that now
dominate high-performance computing. It used to be that 4
processors were taken seriously as a parallel supercomputer.
Now, there are cell phones with 4 cores. Sequoia, and the IBM
BlueGene/Q system that was at the top of the Top500 supercomputers list in June
2012 was the first machine listed with over a million cores: it
has 1,572,864 cores! Tianhe-2 (MilkyWay-2), the top machine as
of November 2014, has 3,120,000 cores.
I'm going to run this course backwards from how I used to.
Instead of starting with architecture, we're going to start with
user-level programming and do the architecture stuff while
you're working on projects. Why? Unlike GPUs and most other
parallel computers, MIMDs sort-of just work and code is actually
portable. What's more, that wasn't the case two decades ago.
We'll still cover architecture issues and perfromance tuning,
but towards the end of the course.
UK closed February 17 and 19, and March 5, due to
weather. The classes missed those days and February 10 will be made-up
later in the semester by extending the remaining Tuesday classes, starting on
March 31, to begin at 8:15AM and meet in 108 Marksbury.
Course Materials
All course materials will be linked here.
I said will be... as in more is being added.
Shared Nothing
-
Here is the introductory overview of MPI that I gave to the KAOS
group last year (PDF).
-
Here is the UKAN reference card, the back of which is a nice
one-page reference for the most common MPI usage
(PDF)
-
One of the better outlines of MPI (listings with overviews of the
arguments to each function) is
this tutorial
at Lawrence Livermore Labs
-
A nice overview of problems with MPI RMAs, which are inspiring
the MPI-3 changes, is
this document.
-
A really detailed set of slides on MPI IO is this
PDF.
A shorter overview is in
this document.
-
Here (PS or
PDF) is the paper I mentioned before about using barriers to
maintain permutation communication structure on the CM5.
(Note: the principle of using barriers to synchronize
communications was used well before this paper, for example,
in the PASM prototype as early as 1987, but the barrier and communication
mechanisms were quite different from those in the CM5.)
-
Perhaps the most wacky stuff involves executing MIMD code on SIMD hardware.
I've done a lot of this; see here
and here.
Shared Memory
-
Here is a nice overview intro to OpenMP/OMP as slides (PDF). OMP pragmas are understood by recent GCC releases
(GOMP is built-in), but must be enabled by giving
-fopenmp on the gcc command line with no other special
options; my Pi computation example for OMP is mppi.c. Normally, environment variables are used
to control things like how many processes to make
-
POSIX Threads (pthreads) is now a standard library included in
most C/C++ compilation environments, and linked as the -lpthread
library under Linux GCC; my Pi computation example for pthreads is pthreadspi.c
-
My System V shared memory version of the Pi computation is
shmpi.c -- note that this
version uses raw assembly code to implement a lock, which
has far less overhead than using the System V OS calls
(unless you're counting on the OS to schedule based on who's
waiting for what)
-
UPC (Unified Parallel C) is an
extension of the C language, and hence requires a special
compiler. There are several UPC compilers; the fork of GCC
called GUPC must be installed as described at the project
homepage (in my systems, it is installed at
/usr/local/gupc/bin/gupc). My Pi computation example
for UPC is upcpi.upc; compilation
is straightforward, but the executable produced processes some
command line arguments as UPC controls, for example, -n
is used to specify the number of processes to create.
-
Here's "Shared Memory Consistency Models: A Tutorial"
(PDF) -- Sarita Adve has done quite a few versions of this
sort of description
-
AMD64 atomic instructions are listed
here
-
Here's "An Overview of the NYU Ultracomputer Project (1986)"
(PDF) which discusses how it implemented fetch-and-op
-
Many short, yet still confusing, descriptions of Futexes are
available and here's probably the best early overview (PDF); the
catch is that various Linux kernels have different
futex() implementations with 4, 5, or 6 arguments
-
Here's "Treadmarks: Shared Memory Computing on Networks of Workstations,"
(PDF) -- a nice description of how they
implement distributed shared memory
-
Transactional Memory has been a hot idea for quite a while;
now, Intel's Haswell processors have a hardware implementation.
It is described in chapter 8 of this
PDF (locally, PDF);
there is also a transactional extension proposal for C++
PDF.
Project Stuff
One of the nicest things about MIMDs is that they behave a lot like
lots of processes running timeshared -- so you can develop and test
just about everything in this course even on a single-core system.
However, you'll be running code on at least one shared-memory
multi-core system and you'll be running on at least one cluster.
Those systems are housed in 108A Marksbury, but will be accessible
via remote login (ssh). Those systems all run Linux, and everything
submitted must work in that environment, although you can develop
things under windows or OSX.
-
Thanks to Frank Roberts, a list of what hardware is available
for students in this course to use is here.
-
Projects will be submitted via WWW forms and a CGI script.
Register with the system here
before you can submit any projects.
-
Assignment 0 was due February 5, 2015.
(the submission server died, so the deadline is pushed back slightly.)
-
Assignment 1 was due February 26, 2015.
However, it will be accepted without penalty through March 5.
-
Assignment 2 is due April 9, 2015.
There is now a test-case generator program called
linemaker.c. Compile it with
cc linemaker.c -o linemaker and run it with
two arguments: the seed and the number of lines to generate.
A negative seed generates an optimal solution, while the
positive seed generates a randomized line order.
You do not need to understand the details of GAs for this project,
but here is DUMEC.tgz, the collection of
sequential search codes I used to introduce GAs in class.
-
Assignment 3 is due April 23, 2015.
This is a really simple assignment using OpenMP.
Don't try to make it harder than it is.
-
Assignment 4 is due May 7, 2015.
This is a really simple assignment using both MPI and OpenMP.
(The full assignment isn't quite posted yet.)
Course Staff
Professor Hank Dietz is usually in his office,
203 Davis Marksbury Building,
and has an "open-door" policy that whenever his door is open and
he's not busy with someone else, he's available. However, there
are quite a few other places he might be around campus.
His schedule is at http://aggregate.org/hankd/cal.html, which also shows
live sensor info about his availability.
Alternatively, you also can email hankd@engr.uky.edu to make an appointment.
About The Graphic
About the graphic: It's NAK, a 64-node cluster with a GPU in
every node, but dressed-up as the machine that runs everything
in the movie Metropolis. There's a lot to program in there, and
the huge set of independent workers is a pretty good model for
how that's done. Fortunately, the machine here also carries a
useful message written in friendly big red letters.
Cluster/Multi-Core Computing