My deep learning machine has ended up with two repositories for cuda dependencies and cuda is periodically failing. To clean this up I want to purge all packages and configuration related to cuda and then reinstall from scratch.
Purging
The purging comes in two stages - there are all the packages related to cuda and nvidia, and then there are the custom sources and keys.
Purging Packages
The packages that are installed in the system can be found with dpkg -l
. This lists the known packages in a machine readable way. For example:
➜ dpkg -l | grep nvidia
ii libnvidia-cfg1-535:amd64 535.113.01-0ubuntu0.20.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-535 535.104.12-0ubuntu1 all Shared files used by the NVIDIA libraries
rc libnvidia-compute-450:amd64 450.119.04-0ubuntu1 amd64 NVIDIA libcompute package
...
The cuda packages either have cuda
or nvidia
in the name, and the actual package of interest is the second term in the line. We can extract this using awk '{ print $2 }'
to print the second argument of the line:
➜ dpkg -l | grep nvidia | awk '{ print $2 }'
libnvidia-cfg1-535:amd64
libnvidia-common-535
libnvidia-compute-450:amd64
...
With this we can then use xargs
which takes the standard input and appends it to the command:
➜ dpkg -l | grep nvidia | awk '{ print $2 }' | xargs sudo apt-get remove --purge
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED
libnvidia-cfg1-535* libnvidia-common-535* libnvidia-compute-450* ...
As the standard input of apt-get is taken by xargs you will not be able to confirm the operation and it will abort. This gives you a chance to review the packages that will be removed. Packages that are not installed will be ignored.
To actually remove them we just add the --yes
option to apt-get
. That gives us two commands to purge the cuda packages:
➜ dpkg -l | grep cuda | awk '{ print $2 }' | xargs sudo apt-get remove --purge --yes
➜ dpkg -l | grep nvidia | awk '{ print $2 }' | xargs sudo apt-get remove --purge --yes
After this there may be packages that were installed to support cuda that are no longer required. We can remove them with:
sudo apt-get autoremove
Purging Sources
The cuda installation instructions get you to write a source to /etc/apt/sources.list.d/
. Checking this and the base source list can find the ones related to cuda:
grep nvidia /etc/apt/sources.list /etc/apt/sources.list.d/*
After removing these files you need to refresh your apt cache:
sudo apt-get update
Purging Keyrings
The final part is to purge the keys from the apt keyring. To find it we first list the keys:
➜ sudo apt-key list
/etc/apt/trusted.gpg
--------------------
...
pub rsa4096 2017-09-28 [SCE]
C95B 321B 61E8 8C18 09C4 F759 DDCA E044 F796 ECB0
uid [ unknown] NVIDIA CORPORATION (Open Source Projects) <cudatools@nvidia.com>
...
Here we can see the NVIDIA Corporation key. The id of this key is the long hexadecimal number, and we can refer to this key using F796ECB0
. The list output does not make it easy to understand the format.
To remove this key we then run:
➜ sudo apt-key del F796ECB0
OK
Checking the output of list should show that it has been deleted.
Checking Removal
We can check that cuda is not available to install by updating the apt cache and then installing it:
➜ sudo apt-get update
...
➜ sudo apt-get install cuda
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package cuda
We can check that nvidia-smi
is unavailable:
➜ nvidia-smi
zsh: command not found: nvidia-smi
We can check that torch reports cuda as unavailable:
➜ poetry run python -c 'import torch; print(torch.cuda.is_available())'
False
(I’m doing this in a virtual environment, pytorch is not installed globally).
Reinstalling
The installation instructions cover ubuntu. The basic steps are:
Install the linux headers:
➜ sudo apt-get install linux-headers-$(uname -r)
Check that gcc is installed and working:
➜ gcc --version
Using the network installer with $distro/$arch
of ubuntu2204/x86_64
:
➜ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
➜ sudo dpkg -i cuda-keyring_1.1-1_all.deb
Then we just have to update and install:
➜ sudo apt-get update
➜ sudo apt-get install cuda-toolkit nvidia-driver-545 nvidia-utils-545
After this a restart is required to load the driver correctly.
Checking the Installation
We can check the installation using the command line tool nvidia-smi
:
➜ nvidia-smi
Sat Sep 30 20:57:02 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA TITAN RTX On | 00000000:01:00.0 Off | N/A |
| 41% 61C P0 78W / 280W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
We can also check that we can use cuda within pytorch:
➜ poetry run python -c '
import torch;
print(torch.cuda.is_available());
print(torch.tensor([1,2,3], device="cuda"))'
True
tensor([1, 2, 3], device='cuda:0')
Looks good!
I really need to install that other graphics card that my mate gave me. Would be good.