Sometimes my cuda setup stops working. This post is a reminder for future-me to help me get up and running again quickly.
Kernel Headers Missing
When I update the kernel of my machine the nvidia-smi reports:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
This is likely because the kernel headers have not been installed. To install them just run:
sudo apt-get install linux-headers-$(uname -r)
You do not need to restart after running this.
It would be nice if there was a way to get these with a meta package.
CUDA reinstallation problem
I had multiple versions of CUDA on my machine. Some were manually installed and some were through apt get. I cleared everything out and reinstalled cuda.
Then I got that same error again
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
In this case the underlying problem was that the linux side of the kernel drivers were missing. I had to run
➜ ubuntu-drivers list --gpgpu
(sudo not required)
This then produces a list like:
nvidia-driver-535-open, (kernel modules provided by linux-modules-nvidia-535-open-generic)
nvidia-driver-450-server, (kernel modules provided by linux-modules-nvidia-450-server-generic)
nvidia-driver-465, (kernel modules provided by nvidia-dkms-465)
nvidia-driver-510, (kernel modules provided by nvidia-dkms-510)
nvidia-driver-525-server, (kernel modules provided by linux-modules-nvidia-525-server-generic)
nvidia-driver-515, (kernel modules provided by nvidia-dkms-515)
nvidia-driver-530, (kernel modules provided by nvidia-dkms-530)
nvidia-driver-418-server, (kernel modules provided by linux-modules-nvidia-418-server-generic)
nvidia-driver-470-server, (kernel modules provided by linux-modules-nvidia-470-server-generic)
nvidia-driver-525, (kernel modules provided by linux-modules-nvidia-525-generic)
nvidia-driver-535, (kernel modules provided by linux-modules-nvidia-535-generic)
nvidia-driver-450, (kernel modules provided by nvidia-dkms-450)
nvidia-driver-470, (kernel modules provided by linux-modules-nvidia-470-generic)
nvidia-driver-460, (kernel modules provided by nvidia-dkms-460)
nvidia-driver-455, (kernel modules provided by nvidia-dkms-455)
nvidia-driver-520, (kernel modules provided by nvidia-dkms-520)
nvidia-driver-525-open, (kernel modules provided by linux-modules-nvidia-525-open-generic)
nvidia-driver-495, (kernel modules provided by nvidia-dkms-495)
The missing install is on the left hand side. So take the highest number (535 in this case) and install one of the drivers, so given the above:
sudo apt-get install linux-modules-nvidia-535-open-generic