Assignment 0: Thanks For The Memories

The goal of this first NVIDIA CUDA project is to develop an understanding of the coherence model employed by NVIDIA's hardware in accessing global memory. This is important because NVIDIA's global memory really can be used for communication within a kernel, but only if you choose the virtualization parameters wisely.

To Begin

Since this is your first NVIDIA CUDA project, you'll need to install some stuff... you get the stuff from NVIDA. If you are installing on your own machine, you'll need to install the NVIDIA CUDA 4.0 drivers and, most likely, GCC 4.4. The latest CUDA support conflicts with some GCC 4.5 stuff that most modern Linux systems use, so you'll likely want both versions of GCC installed. The machines in Marksbury already have that installed. Basically, the last step is to install SDK for CUDA 4.0. This will make a directory hierarchy called NVIDIA_GPU_Computing_SDK in your home directory, and each student has their own, so feel free to play with modifying any of the applications there.

In fact, I strongly recommend that you build the applications in NVIDIA_GPU_Computing_SDK/C and then run at least a few of them. The executables are placed in NVIDIA_GPU_Computing_SDK/C/bin/linux/release. Some of the code supplied by NVIDIA will hang the system, so don't be shocked when that happens... just reboot the machine. The best examples to start examining source code for are deviceQuery and vectorAdd. They contain the basic ideas for everything you'll need to do in this first project.

The Point Of The Project

Memory coherence on NVIDIA GPUs is largely an artifact of the hardware scheduling policy. Thus, the key influence on conherence is not what addresses one is touching, but how many warps and blocks are being managed. Recall that a "warp" is 32 PEs ("threads" in NVIDIA speak), and even though the hardware really consists of some number of 8-wide SIMD engines (aka, "multiprocessors"), a group of 32 PEs always is scheduled together as a 4-cycle unit. Further, local memory (aka, "shared memory") is allocated to be shared only within a block, which is essentially some number of warps grouped together.

Fundamentally, each block executes only on a single SIMD engine, and timesharing or multithreading happens only within or among the blocks that reside on a particular SIMD engine. You can request blocks of any size, and larger is usually better because it provides more opportunity to hide global memory latency. However, too big a block will fail to be scheduled. Similarly, if you request too many blocks, they will be broken into timeshared batches such that, within a SIMD engine, one batch must run to completion before the next batch is allowed to begin.

You are going to write and run a program that tests a wide range of block sizes and numbers of blocks to determine under what circumstances you can be sure that your entire program will be executed in a single batch... thus providing memory coherence model that allows communication through global memory writes and reads.

Functional Requirements

Your program is going to test a range of values for the block size and the number of blocks to see if memory behaves coherently. The relevant kernel code is given to you:

/* Random formula from numerical recipes */
#define HASH(p) ( \
	((unsigned)(((p) * 1664525) + 1013904223)) % \
	((unsigned)NPROC))
#define	TRIES 1024

/* Device kernel program */
__global__ void GetIt(volatile int *ref, volatile int *got, int NPROC)
{
    int IPROC = blockDim.x * blockIdx.x + threadIdx.x;
    int h = HASH(IPROC);
    int histno = -1;

    ++(ref[IPROC]);
    __threadfence();
    do {
        got[IPROC] = ++histno;
    } while ((histno < TRIES) && ((ref[h] != (h+1)));
}

This kernel simply has every virtual PE increment its ref value and then poll the ref value of PE h to see when it becomes equal to h+1. While polling, it keeps track of the number of times it has polled. This count is also used to abort the polling loop if things look like they will never change. Notice the volatile declarations and __threadfence(); both of these are used to ensure memory writes are neither buffered nor rearranged by compile-time instruction scheduling.

What your code must do is:

Allocate the ref and got arrays on the GPU.
The ref array should be initialized to ref[i]=i and all entries of got should be initialized to a large number, e.g., TRIES+1
Invoke GetIt<<<block_count, block_size>>>(d_ref, d_got, nproc); and collect got back on the host
On the host, scan got to answer the questions listed below.
Pick another block_count and block_size and repeat from step 2. You should try all possible combinations up to 4096 times the number of multiprocessors in the system, but remeber that block_size should always be a multiple of 32 (the warp size).

What You Should Find

What you should observe is that there is a range of parameters for which coherent behavior is observed. For larger sizes, there will be some fraction of the communications that fail... but not all of them. You should determine and report the answers to two questions:

What is the maximum total number of virtual PEs you can run with no coherence failures, i.e., without any thread failing to get the correct value communicated? Also give the minimum block_count and maximum block_size used to achieve this value.
What is the maximum total number of virtual PEs you can run with no coherence failures and no additional iterations needed to get the correct value communicated? Also give the minimum block_count and maximum block_size used to achieve this value.

Be sure to indicate on which machine you measured this -- which GPU.

Due Dates, Submission Procedure, & Such

You will be submitting source code, a make file, and a simple HTML-formatted "implementor's notes" document which also should summarize your findings.

For full consideration, your project should be submitted no later than November 23, 2011. Submit your tar file here:

GPU Computing