One of the big issues in use of GPUs, or nearly any cheap commodity hardware, for scientific and engineering problems is the heavy reliance on double precision floating-point arithmetic. Of course, any "Turing capable" machine can implement 64-bit IEEE-compliant double-precision arithmetic, but commodity applications don't need this, so hardware is not optimized for these operations. For example, even GPUs that implement 64-bit double arithmetic in hardware still have memory banking and other features tuned for handling 32-bit objects. How can we provide accuracy like double precision in a way that is efficient even on GPUs that have no hardware support for doubles?
As we discussed in class, perhaps the best answer is to use pairs of native float values instead. In native pair arithmetic, a native (single) float result value is computed and a second native value is then set to hold the error in the first value. We'll often refer to these values as the "hi" and "lo" parts of a native-pair value. It takes a fair number of additional operations to compute and maintain the "lo" error terms, but the result is arithmetic of a quality comparable to that obtained using doubles. Further, the extra operations allow the memory layout to be banked better, and the extra arithmetic simply makes it easier to hide memory access delays, so performance may actually be comparable to or even better than using doubles. Let's find out!
To find out, you are going to modify the standard CUDA library matrixMul demo to use 32-bit native-pair arithmetic.
It is perhaps surprising that there are various different algorithms one can use for most native-pair arithmetic operations. This is especially true of multiply, due to the interesting property that simply multiplying the two "hi" parts generates enough bits to make the "lo" parts unimportant enough to use a much poorer approximation in processing them (compare nativepair_mul to good_nativepair_mul; there is quite a bit more code to get a usually negligible improvement in accuracy). The handout we discussed in class describes the algorithms in detail; here they are as C code:
#define NATIVEBITS 24
#define NATIVESPLIT ((1 << (NATIVEBITS - (NATIVEBITS / 2))) + 1.0)
typedef volatile float native;
typedef struct {
native hi, lo;
} nativepair;
nativepair nativepair_zero = { 0 , 0 };
#ifdef USE_INT_MASKING
#define NATIVETOPHALFMASK ~((1 << (NATIVEBITS / 2)) - 1)
typedef volatile int nativeint;
nativepair
double2nativepair(double d)
{
nativepair n;
n.hi = d;
n.lo = (d - n.hi);
return(n);
}
double
nativepair2double(nativepair n)
{
return(((double) n.hi) + ((double) n.lo));
}
native
native_and(native a,
nativeint mask)
{
/* Bit mask a float as though the bits were nativeint */
*((nativeint *) &a) = (mask & *((nativeint *) &a));
return(a);
}
#endif
nativepair
nativepair_normalize(native hi,
native lo)
{
/* Return hi,lo as a normalized nativepair */
nativepair r;
native hierr;
r.hi = hi + lo;
hierr = hi - r.hi;
r.lo = hierr + lo;
return(r);
}
nativepair
nativepair_add(nativepair a,
nativepair b)
{
/* Return a+b */
native hi = a.hi + b.hi;
native lo = a.lo + b.lo;
native bhi = hi - a.hi;
native ahi = hi - bhi;
native bhierr = b.hi - bhi;
native ahierr = a.hi - ahi;
native hierr = bhierr + ahierr;
lo += hierr;
return(nativepair_normalize(hi, lo));
}
nativepair
native_mul(native a,
native b)
{
/* Return the native product a*b as a nativepair.
This is complicated by the fact that multiplication in
general doubles the number of mantissa bits, yet these
extra bits are not available using a native multiply.
In a correctly implemented fused multiply-add, these
extra bits are not discarded (normalized away) until
after the addition, so the code is considerably simpler.
*/
nativepair c;
#ifdef HAS_FUSED_MULADD
/* Actually written for fused multiply-subtract... */
c.hi = a * b;
c.lo = a * b - c.hi;
#else
#ifdef USE_INT_MASKING
native atop = native_and(a, NATIVETOPHALFMASK);
native btop = native_and(b, NATIVETOPHALFMASK);
#else
native asplit = a * NATIVESPLIT;
native bsplit = b * NATIVESPLIT;
native as = a - asplit;
native bs = b - bsplit;
native atop = as + asplit;
native btop = bs + bsplit;
#endif
native abot = a - atop;
native bbot = b - btop;
native top = atop * btop;
native mida = atop * bbot;
native midb = btop * abot;
native mid = mida + midb;
native bot = abot * bbot;
c = nativepair_normalize(top, mid);
c.lo += bot;
#endif
return(c);
}
nativepair
nativepair_mul(nativepair a,
nativepair b)
{
/* Return a*b */
nativepair tops = native_mul(a.hi, b.hi);
native hiloa = a.hi * b.lo;
native hilob = b.hi * a.lo;
native hilo = hiloa + hilob;
tops.lo += hilo;
return(nativepair_normalize(tops.hi, tops.lo));
}
nativepair
good_nativepair_mul(nativepair a,
nativepair b)
{
/* Return a*b */
nativepair tops = native_mul(a.hi, b.hi);
nativepair mid1 = native_mul(a.hi, b.lo);
nativepair mid2 = native_mul(a.lo, b.hi);
nativepair low = native_mul(a.lo, b.lo);
nativepair t;
tops = nativepair_normalize(tops.hi, tops.lo);
mid1 = nativepair_normalize(mid1.hi, mid1.lo);
mid2 = nativepair_normalize(mid2.hi, mid2.lo);
low = nativepair_normalize(low.hi, low.lo);
t = nativepair_add(mid1, low);
t = nativepair_add(mid2, t);
t = nativepair_add(tops, t);
return(t);
}
Begin this project by making a directory:
~/NVIDIA_GPU_Computing_SDK/C/src/npair
You will be using the SDK's make files, and keeping everything named this way will simply add a sample application called npair. Correspondingly, I've made a simplified copy of the code from the matrixMul example with things appropriately named and references to CUBLAS removed. CUBLAS doesn't support doubles on GPUs that don't have the hardware, which is why I've removed it. Grab a copy of NPAIR.tgz and untar it inside the directory you've just created.
Once you've done that, simply typing make should build a straight single-precision version of npair, essentially equivalent to matrixMul, but without the CUBLAS.
You'll notice, however, that the source code no longer talks about float values, but about NUMBER values. Simply changing the definition of that in npair.h to double will build the double-precision version... if your GPU supports double. There may be issues with the assumptions about how much fits in shared memory, so you also may need to reduce the shared memory array sizes by a factor of 2.
Your task is simply to modify this code so that it uses native pair arithmetic inside the GPU, but doubles in the host. You are free to use "creative" data layout of the high and low portions of each native-pair value, you can keep them in consecutive memory addresses or skew them as we discussed in class.
As we discussed, hardware support for doubles is expensive, and it isn't clearly going to give superior results in accuracy or speed as compared to a native pair implementation. In fact, native pair arithmetic has an unfair disadvantage for this particular application because the GPU has multiply-add instructions that give better than normal performance for this algorithm using ordinary single-precision float arithmetic. The question is, how much slower is native pair than ordinary single?
Be sure to indicate on which machine you measured the execution speeds for both single and native-pair versions.
You will be submitting source code, a make file, and a simple HTML-formatted "implementor's notes" document which also should summarize your findings. Basically, I expect a tar of your native-pair version of the npair directory's contents, although you can omit the obj subdirectory and Sdk... files.
For full consideration, your project should be submitted no later than December 1, 2011. Submit your tar file here:
GPU Computing