Problems Problems... CUDA initialization issues

Submitted by Xilodyne on Sat, 10/20/2018 - 00:36

At some point after using my Keras configured Tensorflow, my GPU server becomes unusable, with a load over 10, from a IRQ/133-nvidia process. It isn't possible to KILL the process and I end up having to reboot.

top shows high nvidia irq usage

Googling the problem, it turns out to be a common problem (link). As my server (Ubuntu 16.04 LTS) is "headless", that is I run it without a GUI and only use the command line, one can apparently have problems with stopping and starting CUDA. The solution is to use the NVIDIA Persistence Daemon (link).

Nvidia Persistence Mode
`>sudo nvidia-smi -h \| grep pers Python wrappers to NVML are also available. The output of NVSMI is -pm, --persistence-mode= Set persistence mode: 0/DISABLED, 1/ENABLED replay Used to replay/extract the persistent stats generated by daemon.` `>sudo nvidia-smi -pm 0 Disabled persistence mode for GPU 00000000:01:00.0. Disabled persistence mode for GPU 00000000:04:00.0. All done.`

Problems Problems... CUDA initialization issues

Tags