Sudden unexplained Tesla GPU failures: a corrupt JIT cache

Having just spent several hours trying to debug an issue I thought I should share it with the internet. It seem fairly obscure and I didn’t find any hints online about what might be causing it.

The Symptoms

  • Executing CUDA programs fail either giving malloc errors or ‘unknown error’!
  • Nothing important was changed (no new drivers, no software updates, no new CUDA toolkit).
  • Previously working binaries now fail!
  • GPU resets and full system reboots don’t fix the problem.
  • Families of GPUs fail across a cluster (e.g. all P100s fail in all nodes but the K80s even in the same nodes still work).

The final symptom was the key to solving the problem:

  • Everything works fine for other users!

The Cause

After a lot of intensive problem solving I found that the CUDA just in time (JIT) compiler cache was corrupt. Simply deleting the cache fixed the problem.

JIT compilation allows CUDA to create optimized code for the GPU it is about to run on. This means that during the original compilation we didn’t need to compile a different set of kernels for every possible generation of GPUs we might run on. However, I did not know that the CUDA runtime caches these binaries. On Linux it is located in ~/.nv/–hence why corruption only affects one user.

It’s not clear how the cache became corrupted but one possibility is that it is because our /home folders are network mounted. My guess is that multiple processes and nodes were caching in the same locations simultaneously and something went wrong.

The Solution

It seems there’s a couple of reasonable fixes to avoid this in the future:

1. Turn of JIT caching:

 CUDA_CACHE_DISABLE=1

2. Make the cache local rather than on a network drive:

 CUDA_CACHE_PATH=/some/local/path

More information on JIT caching can be found here: CUDA Pro Tip: Understand Fat Binaries and JIT Caching