NVidia Driver Woes

The solution to a recent problem I had with NVidia and CUDA
Published

June 7, 2021

I recently had an unplanned restart of my deep learning machine. When it started up CUDA failed to work and I got the message NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Sure enough the /dev/nvidia* files were missing.

I normally run the machine headless as I can just ssh into it and do everything remotely. Since some of the advice online suggested turning off secure boot I had to hook the machine up to a monitor and keyboard. After doing this I discovered that secure boot was off and that the screen became graphically corrupted after loading the nvidia driver. The machine doesn’t even have a graphical user interface, so the corruption was of the lines of text.

After doing a lot of uninstalling, downgrading, and testing I found that the mere act of installing the nvidia driver and loading it was enough to trigger the graphical corruption. So the problem wasn’t to do with CUDA or apex (which I separately suspected).

Instead the kernel had updated and the driver had problems with the new version. What I needed to do was to install the linux headers associated with the new kernel:

sudo apt-get install linux-headers-$(uname -r)

Maybe you should try this if you have a similar problem?