References: EE380 Performance Analysis
This material is largely in the end of the first chapter of
the textbook.
Make sure you are comfortable with the tabular breakdown of
expected instruction execution counts, CPIs, and clock period.
Other stuff:
-
The most prominent benchmark is HPL (High
Performance Linpack), which solves systems of linear
equations mostly by doing lots of matrix multiplies. This is a
particularly "supercomputer friendly" benchmark because
performance of communication between PEs becomes less important
as the problem is scaled-up, and the benchmark allows scaling
the problem as big as can fit in the machine rather than timing
the same-size problem on all machines. The results are reported
as FLOPS obtained in running the benchmark. The Top 500 supercomputers in the
world by this metric is a list everyone watches closely... which
has been good for UK in that machines operated by CCS historically place well
on it, which is true of fewer than ten US universities. That
said, UK is not on the most recent list: the 288-node,
4,736-core, Lipscomb peaks around 140 TFLOPS, and the 500th-ranked
machine on the June 2014 list hits 145.6 TFLOPS. The supercomputing facilities in
my lab total around 1/6 of the performance of Lipscomb, but
that is spread across clusters with a total of about 500 nodes
that have various "interesting" configurations, including
innovative network designs, Nvidia GPUs, AMD GPUs, and even
FPGAs. Adding my lab to Lipscomb could have ranked us as high as
around 400th, but they really are separate facilities.
-
It is worth noting that the machines on the Top 500 list have recently
made a turn toward the huge -- Tianhe-2 has 3,120,000
cores! Although many-core chips (mostly GPUs) are now
common in these machines, we're still on a plateau where scaling
up is largely a matter of money rather than new technology, and
it seems there's always more money for matters of national pride
(i.e., countries fighting for positions at the top of the list).
Additionally, when the price/performance improvements due to new
technologies slow, budget for big machines in general seems to
go up to compensate. Machine cost is not listed, but has
definitely gone up sharply over the last decade. The currently
top machine, the Chinese Tianhe-2, tops the 43rd list with 33.86
PFLOPS on HPL and 54.90 PFLOPS peak. That said, the US still
dominates the 500 with 233 machines on the list and 6 of the top
10 machines, but China, the UK, France, Germany, and Japan are
still major players.
-
A nice reference for standard benchmarks is SPEC, the Standard Performance
Evaluation Corporation. The text has always been fond of SPEC,
but it's good to understand that there are many
benchmark suites out there, and how much they really matter to
you depends on how much your application(s) look like them. For
example, the HPL benchmark is very intensely using
double-precision floating-point multiply-add, but doesn't even
count integer operations. Many applications are dominated by the
performance of integer, or even character, processing.
-
Here's another interesting tidbit: the US government
traditionally uses CTP (Composite Theoretical Performance) in
MTOPS (Millions of Theoretical Operations Per Second) as the
primary performance measure for export control purposes. The
formula for computing MTOPS is pretty convoluted and probably
not any more accurate than quoting "Peak MFLOPS." In any case,
the export controls don't really work anymore because
supercomputers are built using mostly commodity parts -- many of
which no longer come from the US (even if the companies that
designed them are based here). President Obama set the export limit as 3.0 TFLOPS on March 16, 2012.
Computer Organization and Design.