Sudden unexplained Tesla GPU failures: a corrupt JIT cache

Having just spent several hours trying to debug an issue I thought I should share it with the internet. It seem fairly obscure and I didn’t find any hints online about what might be causing it.

The Symptoms

  • Executing CUDA programs fail either giving malloc errors or ‘unknown error’!
  • Nothing important was changed (no new drivers, no software updates, no new CUDA toolkit).
  • Previously working binaries now fail!
  • GPU resets and full system reboots don’t fix the problem.
  • Families of GPUs fail across a cluster (e.g. all P100s fail in all nodes but the K80s even in the same nodes still work).

The final symptom was the key to solving the problem:

  • Everything works fine for other users!

