At some point after using my Keras configured Tensorflow, my GPU server becomes unusable, with a load over 10, from a IRQ/133-nvidia process. It isn't possible to KILL the process and I end up having to reboot.
Googling the problem, it turns out to be a common problem (link). As my server (Ubuntu 16.04 LTS) is "headless", that is I run it without a GUI and only use the command line, one can apparently have problems with stopping and starting CUDA. The solution is to use the NVIDIA Persistence Daemon (link).
|