NAK Status

It is currently implemented very crudely, but the above slow-update image represents the latest known status of NAK's GPUs. NVIDIA does not really provide an obvious way to monitor the general utilization properties of a GPU, but temperature sensing is easy using nvidia-smi -a -q, which is what this script does with 300s gap between updates.

In the image, node color represents temperature. The displayed range goes from 35 C (blue) to 75 C (red). Nodes for which new data has not been reported are shown as magenta... there are various reasons why that can happen, although a system crash is certainly one of the more obvious.

Notice that we have found quite a wide variation in node GPU temperature under identical circumstances. It seems that as much as a 22 C temperature difference occurs quite easily. Some of the temperature difference can be explained by differences in airflow patterns, but manufacturing/scheduling differences are left to explain much of this surprizingly large temperature range. Of course, the whole machine gets warmer as either the air conditioning cycles off or the GPU computations become more intense.


The Aggregate. The only thing set in stone is our name.