The goal of this first NVIDIA CUDA project is to develop an understanding of the coherence model employed by NVIDIA's hardware in accessing global memory. This is important because NVIDIA's global memory really can be used for communication within a kernel, but only if you choose the virtualization parameters wisely.
Since this is your first NVIDIA CUDA project, you'll need to install some stuff... you get the stuff from NVIDA. If you are installing on your own machine, you'll need to install the NVIDIA CUDA 4.0 drivers and, most likely, GCC 4.4. The latest CUDA support conflicts with some GCC 4.5 stuff that most modern Linux systems use, so you'll likely want both versions of GCC installed. The machines in Marksbury already have that installed. Basically, the last step is to install SDK for CUDA 4.0. This will make a directory hierarchy called NVIDIA_GPU_Computing_SDK in your home directory, and each student has their own, so feel free to play with modifying any of the applications there.
In fact, I strongly recommend that you build the applications in NVIDIA_GPU_Computing_SDK/C and then run at least a few of them. The executables are placed in NVIDIA_GPU_Computing_SDK/C/bin/linux/release. Some of the code supplied by NVIDIA will hang the system, so don't be shocked when that happens... just reboot the machine. The best examples to start examining source code for are deviceQuery and vectorAdd. They contain the basic ideas for everything you'll need to do in this first project.
Memory coherence on NVIDIA GPUs is largely an artifact of the hardware scheduling policy. Thus, the key influence on conherence is not what addresses one is touching, but how many warps and blocks are being managed. Recall that a "warp" is 32 PEs ("threads" in NVIDIA speak), and even though the hardware really consists of some number of 8-wide SIMD engines (aka, "multiprocessors"), a group of 32 PEs always is scheduled together as a 4-cycle unit. Further, local memory (aka, "shared memory") is allocated to be shared only within a block, which is essentially some number of warps grouped together.
Fundamentally, each block executes only on a single SIMD engine, and timesharing or multithreading happens only within or among the blocks that reside on a particular SIMD engine. You can request blocks of any size, and larger is usually better because it provides more opportunity to hide global memory latency. However, too big a block will fail to be scheduled. Similarly, if you request too many blocks, they will be broken into timeshared batches such that, within a SIMD engine, one batch must run to completion before the next batch is allowed to begin.
You are going to write and run a program that tests a wide range of block sizes and numbers of blocks to determine under what circumstances you can be sure that your entire program will be executed in a single batch... thus providing memory coherence model that allows communication through global memory writes and reads.
Your program is going to test a range of values for the block size and the number of blocks to see if memory behaves coherently. The relevant kernel code is given to you:
/* Random formula from numerical recipes */
#define HASH(p) ( \
((unsigned)(((p) * 1664525) + 1013904223)) % \
((unsigned)NPROC))
#define TRIES 1024
/* Device kernel program */
__global__ void GetIt(volatile int *ref, volatile int *got, int NPROC)
{
int IPROC = blockDim.x * blockIdx.x + threadIdx.x;
int h = HASH(IPROC);
int histno = -1;
++(ref[IPROC]);
__threadfence();
do {
got[IPROC] = ++histno;
} while ((histno < TRIES) && ((ref[h] != (h+1)));
}
This kernel simply has every virtual PE increment its ref value and then poll the ref value of PE h to see when it becomes equal to h+1. While polling, it keeps track of the number of times it has polled. This count is also used to abort the polling loop if things look like they will never change. Notice the volatile declarations and __threadfence(); both of these are used to ensure memory writes are neither buffered nor rearranged by compile-time instruction scheduling.
What your code must do is:
What you should observe is that there is a range of parameters for which coherent behavior is observed. For larger sizes, there will be some fraction of the communications that fail... but not all of them. You should determine and report the answers to two questions:
Be sure to indicate on which machine you measured this -- which GPU.
You will be submitting source code, a make file, and a simple HTML-formatted "implementor's notes" document which also should summarize your findings.
For full consideration, your project should be submitted no later than November 23, 2011. Submit your tar file here:
GPU Computing