Problems Problems... CUDA initialization issues

Submitted by Xilodyne on Sat, 10/20/2018 - 00:36
sad toddler

At some point  after using my Keras configured Tensorflow, my GPU server becomes unusable, with a load over 10, from a IRQ/133-nvidia process.  It isn't possible to KILL the process and I end up having to reboot.

top shows high nvidia irq usage

Googling the problem, it turns out to be a common problem (link).  As my server (Ubuntu 16.04 LTS) is "headless", that is I run it without a GUI and only use the command line, one can apparently have problems with stopping and starting CUDA.  The solution is to use the NVIDIA Persistence Daemon (link). 

Nvidia Persistence Mode

>sudo nvidia-smi -h | grep pers
Python wrappers to NVML are also available.  The output of NVSMI is
    -pm,  --persistence-mode=   Set persistence mode: 0/DISABLED, 1/ENABLED
    replay                      Used to replay/extract the persistent stats generated by daemon.

>sudo nvidia-smi -pm 0
Disabled persistence mode for GPU 00000000:01:00.0.
Disabled persistence mode for GPU 00000000:04:00.0.
All done.